2. Set up, import data, and prepare data
3.1 Initial Exploration of Time Series Data Set
3.2 Descriptive Statistics of Time Series Data Frame
3.3 Finding the columns having the NULL values
3.4 Total Inside Sales , everything that isn't made at the store
3.6 Combined Average Sales (Inside + Food Service) by Site ID
3.7 Average Diesel Sales by Site_ID
3.8 Average Unleaded Sales by Site_ID
3.9 Distribution of Sales (Inside + FoodService)
3.10 Distribution of Fuel Sales (Diesel + Unleaded)
3.11 Correlation matrix of all the 4 target variables
3.12 Average Inside Sales by Day of the Week
3.13 Average Food Service Sales by Day of the Week
3.14 Combined Average Sales by Day of the Week
3.15 Average Diesel & Unleaded sales by Day of the Week
3.17 Weekly Sales (Inside + Food Service)
3.18 Monthly Aggregated Plot - Total inside sales and Food Service sales
4.1 Set up, import data, and prepare Quantitative data
4.2 Finding Columns having Null values
4.4 Finding variability of columns
4.6 Separate variables into categorical and continuous
4.7 Visualizing Qualitative Features
4.8 Label encoding for converting Categorical variables to Numerical variables
5. Merging the time series dataset with the qualitative dataset
6.1 First Year Sales Analysis (Inside + Food Service)
6.2 First Year Fuel Sales Analysis (Diesel + Unleaded)
6.3 First Year Sales Visualizations
7. Top Performing store (First-Year non-fuel sales)
7.1 Time-Series Decomposition for the Top-performing store
7.2 Autocorrelation for Top Store
7.4 Time Series Decomposition of Top 5 Stores with the highest First-Year Non-fuel Sales
8. Top Performing Store (First-Year Fuel Sales)
8.1 Time-Series Decomposition for the Top-performing store
8.2 Time Series Decomposition of Top 5 Stores with the highest First-Year Fuel Sales
Maverik, a high-growth retail firm planning to open 30 new stores annually. Accurate sales forecasts for the first year of these new stores are crucial for effective financial planning and resource allocation. Leveraging both qualitative and time series data, we will employ various machine learning algorithms to predict daily sales metrics. Our objective is to achieve forecasts that closely match the actual Return on Investment (ROI) and provide accurate sales metrics. The project focuses on generating daily-level forecasts, incorporating qualitative insights and network-wide seasonality patterns, to support Maverik's strategic decision-making.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
import os
os.chdir('/content/drive/MyDrive/maverik-capstone') # changing the default directory
df_tsdm = pd.read_csv("time_series_data_msba.csv")
df_tsdm.head(20)
| Unnamed: 0 | capital_projects.soft_opening_date | calendar.calendar_day_date | calendar.fiscal_week_id_for_year | calendar.day_of_week | calendar_information.holiday | calendar_information.type_of_day | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel | unleaded | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 6/14/2022 | 6/17/2022 | 25 | Friday | NONE | WEEKDAY | 2168.2920 | 861.6930 | 722.7745 | 1425.1020 | 24535 |
| 1 | 2 | 6/14/2022 | 6/22/2022 | 25 | Wednesday | NONE | WEEKDAY | 2051.5635 | 808.0275 | 730.4850 | 1436.2740 | 24535 |
| 2 | 3 | 6/14/2022 | 6/23/2022 | 25 | Thursday | NONE | WEEKDAY | 2257.5000 | 966.4410 | 895.7970 | 1594.3725 | 24535 |
| 3 | 4 | 6/14/2022 | 6/26/2022 | 26 | Sunday | NONE | WEEKEND | 1520.5925 | 542.3250 | 584.2900 | 1124.9280 | 24535 |
| 4 | 5 | 6/14/2022 | 6/27/2022 | 26 | Monday | NONE | WEEKDAY | 1897.6930 | 771.4525 | 852.2605 | 1640.2540 | 24535 |
| 5 | 6 | 6/14/2022 | 6/30/2022 | 26 | Thursday | NONE | WEEKDAY | 2377.8965 | 878.6750 | 966.6860 | 1890.6510 | 24535 |
| 6 | 7 | 6/14/2022 | 7/1/2022 | 27 | Friday | NONE | WEEKDAY | 2696.2250 | 943.3095 | 894.9325 | 2202.2665 | 24535 |
| 7 | 8 | 6/14/2022 | 7/4/2022 | 27 | Monday | Independence Day | WEEKDAY | 2100.8750 | 686.1155 | 539.8400 | 1337.3815 | 24535 |
| 8 | 9 | 6/14/2022 | 7/6/2022 | 27 | Wednesday | NONE | WEEKDAY | 2402.5610 | 872.0180 | 994.6825 | 1993.9115 | 24535 |
| 9 | 10 | 6/14/2022 | 7/8/2022 | 28 | Friday | NONE | WEEKDAY | 2777.7190 | 1004.1465 | 1231.6675 | 2461.2700 | 24535 |
| 10 | 11 | 6/14/2022 | 7/9/2022 | 28 | Saturday | NONE | WEEKEND | 2227.2495 | 797.3770 | 869.2880 | 1719.4450 | 24535 |
| 11 | 12 | 6/14/2022 | 7/10/2022 | 28 | Sunday | NONE | WEEKEND | 2045.0955 | 697.8090 | 658.9170 | 1600.5080 | 24535 |
| 12 | 13 | 6/14/2022 | 7/12/2022 | 28 | Tuesday | NONE | WEEKDAY | 2544.9095 | 1046.7135 | 960.0990 | 2080.7045 | 24535 |
| 13 | 14 | 6/14/2022 | 7/15/2022 | 29 | Friday | NONE | WEEKDAY | 3193.7290 | 1308.3945 | 1554.9415 | 2724.9740 | 24535 |
| 14 | 15 | 6/14/2022 | 7/20/2022 | 29 | Wednesday | NONE | WEEKDAY | 2983.2075 | 1164.9120 | 1411.6445 | 2652.4400 | 24535 |
| 15 | 16 | 6/14/2022 | 8/6/2022 | 32 | Saturday | NONE | WEEKEND | 2892.3580 | 977.9770 | 1060.6960 | 3037.0200 | 24535 |
| 16 | 17 | 6/14/2022 | 8/9/2022 | 32 | Tuesday | NONE | WEEKDAY | 3147.4065 | 1278.9525 | 1341.1475 | 3681.8880 | 24535 |
| 17 | 18 | 6/14/2022 | 8/13/2022 | 33 | Saturday | NONE | WEEKEND | 2893.6285 | 953.0290 | 914.5185 | 3233.2790 | 24535 |
| 18 | 19 | 6/14/2022 | 8/14/2022 | 33 | Sunday | NONE | WEEKEND | 2726.1780 | 940.7825 | 915.2605 | 2802.6320 | 24535 |
| 19 | 20 | 6/14/2022 | 8/16/2022 | 33 | Tuesday | NONE | WEEKDAY | 3282.8145 | 1138.9035 | 1924.7410 | 3928.3790 | 24535 |
df_tsdm.describe()
| Unnamed: 0 | calendar.fiscal_week_id_for_year | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel | unleaded | site_id_msba | |
|---|---|---|---|---|---|---|---|
| count | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 |
| mean | 6954.500000 | 26.501079 | 2846.537988 | 759.922326 | 1702.585227 | 2382.091588 | 23041.052632 |
| std | 4015.038107 | 14.998715 | 981.299870 | 341.578220 | 2161.208192 | 1025.518658 | 710.634218 |
| min | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 240.180500 | 21560.000000 |
| 25% | 3477.750000 | 14.000000 | 2181.156250 | 521.087875 | 383.062750 | 1654.149000 | 22540.000000 |
| 50% | 6954.500000 | 26.000000 | 2693.976250 | 697.434500 | 1018.920000 | 2256.677500 | 22907.500000 |
| 75% | 10431.250000 | 39.000000 | 3325.306250 | 924.282625 | 2283.297625 | 2928.254000 | 23555.000000 |
| max | 13908.000000 | 52.000000 | 7172.466000 | 2531.662000 | 20853.952000 | 8077.233500 | 24535.000000 |
df_tsdm.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13908 entries, 0 to 13907 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 13908 non-null int64 1 capital_projects.soft_opening_date 13908 non-null object 2 calendar.calendar_day_date 13908 non-null object 3 calendar.fiscal_week_id_for_year 13908 non-null int64 4 calendar.day_of_week 13908 non-null object 5 calendar_information.holiday 13908 non-null object 6 calendar_information.type_of_day 13908 non-null object 7 daily_yoy_ndt.total_inside_sales 13908 non-null float64 8 daily_yoy_ndt.total_food_service 13908 non-null float64 9 diesel 13908 non-null float64 10 unleaded 13908 non-null float64 11 site_id_msba 13908 non-null int64 dtypes: float64(4), int64(3), object(5) memory usage: 1.3+ MB
# Check for null values in the Time Series DataFrame
null_values_time_series = df_tsdm.isnull().sum()
# Display the columns with null values and their counts
print("Null Values in Time Series DataFrame:")
print(null_values_time_series[null_values_time_series > 0])
Null Values in Time Series DataFrame: Series([], dtype: int64)
# Group the data by 'site_id' and calculate the maximum and minimum inside sales
inside_sales_summary = df_tsdm.groupby('site_id_msba')['daily_yoy_ndt.total_inside_sales'].agg(['max', 'min'])
# Reset the index to make 'site_id' a column instead of an index
inside_sales_summary = inside_sales_summary.reset_index()
# Rename the columns for clarity
inside_sales_summary.columns = ['site_id', 'max_inside_sales', 'min_inside_sales']
# Display the summary
print(inside_sales_summary)
site_id max_inside_sales min_inside_sales 0 21560 5975.6060 1339.6110 1 21980 6191.9060 1094.6740 2 22015 4871.8845 1170.4525 3 22085 7172.4660 1454.4460 4 22120 3849.2125 693.1925 5 22260 4692.3625 1207.9620 6 22330 3690.6800 165.7285 7 22400 4055.4920 0.0000 8 22505 3328.3390 449.0150 9 22540 4663.4350 1002.6240 10 22575 3794.0000 867.7970 11 22645 3346.2660 968.9645 12 22680 5710.0855 1131.3715 13 22715 4399.4545 825.9055 14 22750 6579.7865 1939.8330 15 22785 4282.6840 1740.1405 16 22820 3879.5400 1069.2465 17 22855 4811.7370 1165.8115 18 22890 4269.7655 655.8300 19 22925 3913.6825 1442.9975 20 23065 4264.6415 973.5880 21 23135 3394.3945 1006.6350 22 23240 4309.8160 886.3225 23 23345 5741.7500 1250.8090 24 23380 5895.5750 1553.1915 25 23415 6061.3595 524.1810 26 23450 5153.3440 1157.4745 27 23485 3021.8615 752.6085 28 23555 3753.8830 917.3535 29 23660 5943.5950 16.7545 30 23730 5514.4670 1686.8880 31 23765 4375.6615 1581.8320 32 23835 3860.3180 1340.1150 33 23905 3566.1675 821.5935 34 24150 5578.0900 844.8195 35 24220 5449.6400 1689.1385 36 24255 4558.6765 970.3645 37 24535 4405.1665 1257.7530
daily_sales_max = df_tsdm.groupby('site_id_msba')['daily_yoy_ndt.total_inside_sales'].max()
# Calculate the maximum daily inside sales for each site_id
daily_sales_max = df_tsdm.groupby('site_id_msba')['daily_yoy_ndt.total_inside_sales'].max()
# Find the site_id with the highest and lowest maximum daily inside sales
site_id_highest_sales = daily_sales_max.idxmax()
site_id_lowest_sales = daily_sales_max.idxmin()
# Display the results
print(f"Site ID with the highest maximum daily inside sales: {site_id_highest_sales}")
print(f"Site ID with the lowest maximum daily inside sales: {site_id_lowest_sales}")
Site ID with the highest maximum daily inside sales: 22085 Site ID with the lowest maximum daily inside sales: 23485
df_tsdm.describe()
| Unnamed: 0 | calendar.fiscal_week_id_for_year | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel | unleaded | site_id_msba | |
|---|---|---|---|---|---|---|---|
| count | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 |
| mean | 6954.500000 | 26.501079 | 2846.537988 | 759.922326 | 1702.585227 | 2382.091588 | 23041.052632 |
| std | 4015.038107 | 14.998715 | 981.299870 | 341.578220 | 2161.208192 | 1025.518658 | 710.634218 |
| min | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 240.180500 | 21560.000000 |
| 25% | 3477.750000 | 14.000000 | 2181.156250 | 521.087875 | 383.062750 | 1654.149000 | 22540.000000 |
| 50% | 6954.500000 | 26.000000 | 2693.976250 | 697.434500 | 1018.920000 | 2256.677500 | 22907.500000 |
| 75% | 10431.250000 | 39.000000 | 3325.306250 | 924.282625 | 2283.297625 | 2928.254000 | 23555.000000 |
| max | 13908.000000 | 52.000000 | 7172.466000 | 2531.662000 | 20853.952000 | 8077.233500 | 24535.000000 |
# Check for missing values in the entire DataFrame
null_values = df_tsdm.isnull().sum()
#Calculate the percentage of missing values
total_values = df_tsdm.shape[0] # Total number of rows
percentage_missing = (null_values / total_values) * 100
# Create a DataFrame to display the results
missing_data_info = pd.DataFrame({
'Column': null_values.index,
'Missing Values': null_values.values,
'Percentage Missing': percentage_missing.values
})
# Filter for columns with missing values
columns_with_missing = missing_data_info[missing_data_info['Missing Values'] > 0]
# Display columns with missing values
print("Columns with Missing Values:")
print(columns_with_missing)
Columns with Missing Values: Empty DataFrame Columns: [Column, Missing Values, Percentage Missing] Index: []
# Group data by 'site_id_msba' and calculate the average inside sales and food service sales for each site_id
average_inside_sales = df_tsdm.groupby('site_id_msba')['daily_yoy_ndt.total_inside_sales'].mean()
average_food_service_sales = df_tsdm.groupby('site_id_msba')['daily_yoy_ndt.total_food_service'].mean()
# Combine the average values by summing them
combined_average_sales = average_inside_sales + average_food_service_sales
# Sort the data in descending order of combined average sales
combined_average_sales = combined_average_sales.sort_values(ascending=False)
# Define a color palette for the bars
colors = sns.color_palette('colorblind', len(combined_average_sales))
# Create a bar plot for combined average sales by site_id with different colors
plt.figure(figsize=(15, 6))
ax = combined_average_sales.plot(kind='bar', color=colors)
plt.xlabel('Site ID')
plt.ylabel('Combined Average Sales')
plt.title('Combined Average Sales (Inside + Food Service) by Site ID (Descending Order)')
plt.xticks(rotation=90) # Keep x-axis labels horizontal
# Add y-axis values with some space at the top of each bar
y_offset = 50 # Adjust this value to control the spacing
for i, v in enumerate(combined_average_sales):
ax.text(i, v + y_offset, str(round(v, 2)), ha='center', va='bottom', fontsize=9, rotation='vertical')
# Adjust y-axis limits to provide space at the top
plt.ylim(bottom=0, top=max(combined_average_sales) + y_offset * 2)
plt.tight_layout()
# Show the combined bar plot
plt.show()
Site ID 22085 has the Highest Combined Sales of all the sites where as Site ID 23485 has the lowest combined sales. Top 5 site_ids having highest sales are 22085, 22750, 21980, 23415, 23660
# Group data by 'site_id_msba' and calculate the average diesel sales for each site_id
average_diesel_sales = df_tsdm.groupby('site_id_msba')['diesel'].mean()
# Sort the data in descending order of average diesel sales
average_diesel_sales = average_diesel_sales.sort_values(ascending=False)
# Define a color palette for the bars
colors = sns.color_palette('colorblind', len(average_diesel_sales))
# Create a bar plot for average diesel sales by site_id with different colors
plt.figure(figsize=(15, 6))
ax = average_diesel_sales.plot(kind='bar', color=colors)
plt.xlabel('Site ID')
plt.ylabel('Average Diesel Sales')
plt.title('Average Diesel Sales by Site ID (Descending Order)')
plt.xticks(rotation=90) # Keep x-axis labels horizontal
# Add y-axis values with some space at the top of each bar
y_offset = 50 # Adjust this value to control the spacing
for i, v in enumerate(average_diesel_sales):
ax.text(i, v + y_offset, str(round(v, 2)), ha='center', va='bottom', fontsize=9, rotation='vertical')
# Adjust y-axis limits to provide space at the top
plt.ylim(bottom=0, top=max(average_diesel_sales) + y_offset * 2)
plt.tight_layout()
# Show the bar plot for diesel sales
plt.show()
21980 has the highest Diesel Sales
# Group data by 'site_id_msba' and calculate the average unleaded sales for each site_id
average_unleaded_sales = df_tsdm.groupby('site_id_msba')['unleaded'].mean()
# Sort the data in descending order of average unleaded sales
average_unleaded_sales = average_unleaded_sales.sort_values(ascending=False)
# Define a color palette for the bars
colors = sns.color_palette('colorblind', len(average_unleaded_sales))
# Create a bar plot for average unleaded sales by site_id with different colors
plt.figure(figsize=(15, 6))
ax = average_unleaded_sales.plot(kind='bar', color=colors)
plt.xlabel('Site ID')
plt.ylabel('Average Unleaded Sales')
plt.title('Average Unleaded Sales by Site ID (Descending Order)')
plt.xticks(rotation=90) # Keep x-axis labels horizontal
# Add y-axis values with some space at the top of each bar
y_offset = 50 # Adjust this value to control the spacing
for i, v in enumerate(average_unleaded_sales):
ax.text(i, v + y_offset, str(round(v, 2)), ha='center', va='bottom', fontsize=9, rotation='vertical')
# Adjust y-axis limits to provide space at the top
plt.ylim(bottom=0, top=max(average_unleaded_sales) + y_offset * 2)
plt.tight_layout()
# Show the bar plot for unleaded sales
plt.show()
sns.set(style="whitegrid")
sales_columns = ['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service']
df_tsdm['total_daily_sales'] = df_tsdm[sales_columns].sum(axis=1)
fig, axes = plt.subplots(3,1, figsize=(18, 12))
fig.suptitle('Histograms of Sales-Related Features')
for i, col in enumerate(sales_columns + ['total_daily_sales']):
sns.histplot(df_tsdm[col], bins=50, kde=True, ax=axes[i])
axes[i].set_title(f'Distribution of {col}')
axes[i].set_xlabel(col)
axes[i].set_ylabel('Frequency')
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()
Daily Inside Sales: Appears to be right-skewed, with most stores having lower daily inside sales.
Daily Food Service: This is also right-skewed; the majority of stores have lower daily food service sales.
Total Sales: The aggregated sales show a right-skewed distribution.
These skewnesses suggest that outlier detection and possibly transformation may be necessary for modeling.
fuel_features = ['diesel', 'unleaded']
df_tsdm['total_daily_fuel_sales'] = df_tsdm[fuel_features].sum(axis=1)
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
fig.suptitle('Distribution of Fuel-related sales')
for i, col in enumerate(fuel_features + ['total_daily_fuel_sales']):
sns.histplot(df_tsdm[col], bins=50, kde=True, ax=axes[i])
axes[i].set_title(f'Distribution of {col}')
axes[i].set_xlabel(col + '(in Gallons)')
axes[i].set_ylabel('Frequency')
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()
Diesel: Exhibits a right-skewed distribution similar to the previous variables.
Unleaded: Slightly right-skewed but relatively more evenly distributed compared to other variables.
Total Fuel Sales: Exhibits right-skewed distribution which is inline with diesel and unleaded sales
variables = ['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel', 'unleaded']
correlation_matrix = df_tsdm[variables].corr()
# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
# Convert 'calendar.calendar_day_date' to a datetime object if it's not already
df_tsdm['calendar.calendar_day_date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])
# Extract the day of the week as an integer (0=Monday, 1=Tuesday, ..., 6=Sunday)
df_tsdm['calendar.day_of_week'] = df_tsdm['calendar.calendar_day_date'].dt.weekday
# Group the data by 'calendar.day_of_week' and calculate the average inside sales for each day of the week
average_sales_by_day_of_week = df_tsdm.groupby('calendar.day_of_week')['daily_yoy_ndt.total_inside_sales'].mean()
# Define the names of days of the week
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# Sort the days of the week in descending order of average sales
sorted_days_of_week = [day for _, day in sorted(zip(average_sales_by_day_of_week, days_of_week), reverse=True)]
sorted_sales = sorted(average_sales_by_day_of_week, reverse=True)
# Create a list of unique colors for each day
unique_colors = plt.cm.viridis(np.linspace(0, 1, len(sorted_days_of_week)))
# Create a bar plot with different colors for each day and sorted in descending order
plt.figure(figsize=(10, 6))
bars = plt.bar(sorted_days_of_week, sorted_sales, color=unique_colors)
plt.title('Average Inside Sales by Day of the Week (Descending Order)')
plt.xlabel('Day of the Week')
plt.ylabel('Average Inside Sales')
# Annotate each bar with its respective y-axis value
for bar, value in zip(bars, sorted_sales):
plt.text(bar.get_x() + bar.get_width() / 2 - 0.15, bar.get_height() + 50, str(round(value, 2)), ha='center', color='black', fontsize=10)
plt.tight_layout()
plt.show()
Friday is the day of the week where the Average Inside sales are highest. We can see a gradual increase in average sales from Monday to Friday with the sales dipping on the weekends Saturday and Sunday.
# Convert 'calendar.calendar_day_date' to a datetime object if it's not already
df_tsdm['calendar.calendar_day_date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])
# Extract the day of the week as an integer (0=Monday, 1=Tuesday, ..., 6=Sunday)
df_tsdm['calendar.day_of_week'] = df_tsdm['calendar.calendar_day_date'].dt.weekday
# Group the data by 'calendar.day_of_week' and calculate the average food service sales for each day of the week
average_food_sales_by_day_of_week = df_tsdm.groupby('calendar.day_of_week')['daily_yoy_ndt.total_food_service'].mean()
# Define the names of days of the week
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# Sort the days of the week in descending order of average food service sales
sorted_days_of_week = [day for _, day in sorted(zip(average_food_sales_by_day_of_week, days_of_week), reverse=True)]
sorted_food_sales = sorted(average_food_sales_by_day_of_week, reverse=True)
# Create a list of unique colors for each day
unique_colors = plt.cm.viridis(np.linspace(0, 1, len(sorted_days_of_week)))
# Create a bar plot with different colors for each day and sorted in descending order
plt.figure(figsize=(10, 6))
bars = plt.bar(sorted_days_of_week, sorted_food_sales, color=unique_colors)
plt.title('Average Food Service Sales by Day of the Week (Descending Order)')
plt.xlabel('Day of the Week')
plt.ylabel('Average Food Service Sales')
# Annotate each bar with its respective y-axis value
for bar, value in zip(bars, sorted_food_sales):
plt.text(bar.get_x() + bar.get_width() / 2 - 0.15, bar.get_height() + 20, str(round(value, 2)), ha='center', color='black', fontsize=10)
plt.tight_layout()
plt.show()
Friday is the day of the week where the Average food sales are highest
# Convert 'calendar.calendar_day_date' to a datetime object if it's not already
df_tsdm['calendar.calendar_day_date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])
# Extract the day of the week as an integer (0=Monday, 1=Tuesday, ..., 6=Sunday)
df_tsdm['calendar.day_of_week'] = df_tsdm['calendar.calendar_day_date'].dt.weekday
# Define the names of days of the week starting from Monday
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# Group the data by 'calendar.day_of_week' and calculate the average inside sales and food service sales for each day of the week
average_inside_sales_by_day_of_week = df_tsdm.groupby('calendar.day_of_week')['daily_yoy_ndt.total_inside_sales'].mean()
average_food_sales_by_day_of_week = df_tsdm.groupby('calendar.day_of_week')['daily_yoy_ndt.total_food_service'].mean()
# Combine inside sales and food service sales for each day of the week
combined_sales = average_inside_sales_by_day_of_week + average_food_sales_by_day_of_week
# Create a list of unique colors for each day
unique_colors = plt.cm.viridis(np.linspace(0, 1, len(days_of_week)))
# Create a bar plot with different colors for each day
plt.figure(figsize=(12, 6))
bars = plt.bar(days_of_week, combined_sales, color=unique_colors)
plt.title('Combined Average Sales by Day of the Week (Starting from Monday)')
plt.xlabel('Day of the Week')
plt.ylabel('Combined Average Sales')
# Annotate each bar with its respective y-axis value
for bar, value in zip(bars, combined_sales):
plt.text(bar.get_x() + bar.get_width() / 2 - 0.15, bar.get_height() + 50, str(round(value, 2)), ha='center', color='black', fontsize=10)
plt.tight_layout()
plt.show()
Friday Records the Highest sales. There is a gradual increase from Monday to Friday and then a dip on Weekends
# Convert 'calendar.calendar_day_date' to a datetime object if it's not already
df_tsdm['calendar.calendar_day_date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])
# Extract the day of the week as an integer (0=Monday, 1=Tuesday, ..., 6=Sunday)
df_tsdm['calendar.day_of_week'] = df_tsdm['calendar.calendar_day_date'].dt.weekday
# Group the data by 'calendar.day_of_week' and calculate the average diesel and unleaded sales for each day of the week
average_diesel_sales_by_day_of_week = df_tsdm.groupby('calendar.day_of_week')['diesel'].mean()
average_unleaded_sales_by_day_of_week = df_tsdm.groupby('calendar.day_of_week')['unleaded'].mean()
# Define the names of days of the week
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# Sort the days of the week in the order they appear (no sorting needed)
sorted_days_of_week = days_of_week
# Create a list of unique colors for each day
unique_colors = plt.cm.viridis(np.linspace(0, 1, len(sorted_days_of_week)))
# Create separate bar plots for diesel and unleaded sales, using the same x-axis and y-axis
plt.figure(figsize=(10, 6))
# Plot Diesel Sales
plt.bar(sorted_days_of_week, average_diesel_sales_by_day_of_week, color=unique_colors, label='Diesel')
plt.title('Average Diesel Sales by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Diesel Sales')
plt.legend()
plt.tight_layout()
# Show the first bar plot
plt.show()
# Create a new figure for Unleaded Sales
plt.figure(figsize=(10, 6))
# Plot Unleaded Sales
plt.bar(sorted_days_of_week, average_unleaded_sales_by_day_of_week, color=unique_colors, label='Unleaded')
plt.title('Average Unleaded Sales by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Unleaded Sales')
plt.legend()
plt.tight_layout()
# Show the second bar plot
plt.show()
Diesel sales are highest on Wednesday and Unleaded sales are highest on Friday
# Extract month and year from the date column
df_tsdm['Date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])
# Create a time series slider chart for daily sales
fig = px.line(df_tsdm, x='Date', y=[ 'diesel', 'unleaded'],
labels={'Date': 'Date', 'value': 'Sales'},
title='Time Series Slider Chart of Daily Sales- Diesel & Unleaded',
width=1200, height=600)
# Add a slider for selecting the date range
fig.update_xaxes(rangeslider_visible=True)
# Show the plot
fig.show()
Time Series slider chart of Fuel Sales
# Extract date, year, and month from the date column
df_tsdm['Date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])
df_tsdm['Year'] = df_tsdm['Date'].dt.year
df_tsdm['Month'] = df_tsdm['Date'].dt.strftime('%Y-%m-%d')
# Group the data by date and calculate the sum of sales columns
grouped = df_tsdm.groupby(['Year', 'Month'])[['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel', 'unleaded']].sum().reset_index()
# Create a vertical bar chart for daily sales
fig1 = go.Figure(data=[
go.Bar(name='Inside Sales', x=grouped['Month'], y=grouped['daily_yoy_ndt.total_inside_sales'], marker_color='blue'),
go.Bar(name='Food Service', x=grouped['Month'], y=grouped['daily_yoy_ndt.total_food_service'], marker_color='green'),
go.Bar(name='Diesel', x=grouped['Month'], y=grouped['diesel'], marker_color='orange'),
go.Bar(name='Unleaded', x=grouped['Month'], y=grouped['unleaded'], marker_color='red')
])
# Add a time series slider chart for daily sales
fig2 = px.line(grouped, x='Month', y=['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel', 'unleaded'],
labels={'Month': 'Date', 'value': 'Sales'},
title='Time Series Slider Chart of Daily Sales',
width=800, height=400, range_x=[grouped['Month'].min(), grouped['Month'].max()])
# Combine the two figures into subplots
fig = sp.make_subplots(rows=2, cols=1, shared_xaxes=True)
# Add the vertical bar chart to the first subplot
fig.add_trace(fig1.data[0], row=1, col=1)
fig.add_trace(fig1.data[1], row=1, col=1)
fig.add_trace(fig1.data[2], row=1, col=1)
fig.add_trace(fig1.data[3], row=1, col=1)
# Add the time series slider chart to the second subplot
fig.add_trace(fig2.data[0], row=2, col=1)
fig.add_trace(fig2.data[1], row=2, col=1)
fig.add_trace(fig2.data[2], row=2, col=1)
fig.add_trace(fig2.data[3], row=2, col=1)
# Update the layout of the combined figure
fig.update_layout(
title='Combined Vertical Bar Chart and Time Series Slider Chart',
xaxis_title='Date',
yaxis_title='Sales',
xaxis2=dict(rangeslider=dict(visible=True)),
yaxis2=dict(title='Sales'),
)
# Show the combined figure
fig.show()
Time Series slider chart of all Target Variables
Total Inside Sales - each week
# Convert the 'calendar.calendar_day_date' column to a datetime format
df_tsdm['calendar.calendar_day_date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])
# Extract the fiscal year and fiscal week from the date column
df_tsdm['Fiscal_Year'] = df_tsdm['calendar.calendar_day_date'].dt.year
df_tsdm['Fiscal_Week'] = df_tsdm['calendar.calendar_day_date'].dt.strftime('%U')
# Group the data by fiscal year and fiscal week, then calculate the total inside sales for each week
weekly_inside_sales = df_tsdm.groupby(['Fiscal_Year', 'Fiscal_Week'])['daily_yoy_ndt.total_inside_sales'].sum()
# Create a line plot of total inside sales by fiscal week
plt.figure(figsize=(12, 6))
# Specify custom colors for each year
colors = {
2021: 'blue',
2022: 'green',
2023: 'red'
}
# Plot the data for each fiscal year separately with different colors
for year in df_tsdm['Fiscal_Year'].unique():
data = weekly_inside_sales[year].reset_index()
plt.plot(data['Fiscal_Week'], data['daily_yoy_ndt.total_inside_sales'], marker='o', linestyle='-', label=f'Year {year}', color=colors[year])
plt.title('Total Inside Sales by Fiscal Week')
plt.xlabel('Fiscal Week')
plt.ylabel('Total Inside Sales')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
For year 2021, Week 51 has the highest Inside Sales For year 2022, Week 19 has the highest Inside Sales For year 2023, Week 17 has the highest Inside Sales
Total Food Service Sales - each week
# Convert the 'calendar.calendar_day_date' column to a datetime format
df_tsdm['calendar.calendar_day_date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])
# Extract the fiscal year and fiscal week from the date column
df_tsdm['Fiscal_Year'] = df_tsdm['calendar.calendar_day_date'].dt.year
df_tsdm['Fiscal_Week'] = df_tsdm['calendar.calendar_day_date'].dt.strftime('%U')
# Group the data by fiscal year and fiscal week, then calculate the total food service sales for each week
weekly_food_service_sales = df_tsdm.groupby(['Fiscal_Year', 'Fiscal_Week'])['daily_yoy_ndt.total_food_service'].sum()
# Create a line plot of total food service sales by fiscal week
plt.figure(figsize=(12, 6))
# Specify custom colors for each year
colors = {2021: 'blue', 2022: 'green', 2023: 'red'}
# Plot the data for each fiscal year separately
for year in df_tsdm['Fiscal_Year'].unique():
data = weekly_food_service_sales[year].reset_index()
plt.plot(data['Fiscal_Week'], data['daily_yoy_ndt.total_food_service'], marker='o', linestyle='-', label=f'Year {year}', color=colors[year])
plt.title('Total Food Service Sales by Fiscal Week')
plt.xlabel('Fiscal Week')
plt.ylabel('Total Food Service Sales')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
For year 2021, Week 50 has the highest Food Service Sales For year 2022, Week 32 has the highest Food Service Sales For year 2023, Week 17 has the highest Food Service Sales
Total Combined Sales - each week
# Convert the 'calendar.calendar_day_date' column to a datetime format
df_tsdm['calendar.calendar_day_date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])
# Extract the fiscal year and fiscal week from the date column
df_tsdm['Fiscal_Year'] = df_tsdm['calendar.calendar_day_date'].dt.year
df_tsdm['Fiscal_Week'] = df_tsdm['calendar.calendar_day_date'].dt.strftime('%U')
# Group the data by fiscal year and fiscal week, then calculate the total sales for each week
weekly_total_sales = df_tsdm.groupby(['Fiscal_Year', 'Fiscal_Week'])[['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service']].sum()
# Create a line plot for each fiscal year with specified colors
plt.figure(figsize=(12, 6))
colors = ['b', 'g', 'r'] # Specify custom colors for each year
x_values = [] # Initialize x_values list to accumulate x values
for i, year in enumerate(weekly_total_sales.index.levels[0]):
data_year = weekly_total_sales.loc[year]
data_year['Combined_Total_Sales'] = data_year['daily_yoy_ndt.total_inside_sales'] + data_year['daily_yoy_ndt.total_food_service']
# Plot the combined total sales for the year with the specified color
x_year = list(range(len(data_year)))
plt.plot(x_year, data_year['Combined_Total_Sales'], marker='o', linestyle='-', label=f'Year {year}', color=colors[i])
# Accumulate x values
x_values += x_year
plt.title('Combined Total Sales by Fiscal Week (Each Year Separately)')
plt.xlabel('Week Number')
plt.ylabel('Combined Total Sales')
plt.legend()
plt.grid(True)
# Set the tick locations
plt.xticks(x_values)
plt.tight_layout()
plt.show()
For year 2021, Week 48 has the highest Combined Sales For year 2022, Week 19 has the highest Combined Sales For year 2023, Week 16 has the highest Combined Sales
# Convert the 'calendar.calendar_day_date' column to datetime if it's not already in datetime format
df_tsdm['calendar.calendar_day_date'] = pd.to_datetime(df_tsdm['calendar.calendar_day_date'])
# Create date-related features
df_tsdm['year'] = df_tsdm['calendar.calendar_day_date'].dt.year
df_tsdm['month'] = df_tsdm['calendar.calendar_day_date'].dt.month
df_tsdm['day'] = df_tsdm['calendar.calendar_day_date'].dt.day
df_tsdm['quarter'] = df_tsdm['calendar.calendar_day_date'].dt.quarter
df_tsdm['day_of_week'] = df_tsdm['calendar.calendar_day_date'].dt.dayofweek # 0=Monday, 1=Tuesday, ..., 6=Sunday
df_tsdm['is_weekend'] = (df_tsdm['day_of_week'] >= 5).astype(int) # 1 if it's a weekend, 0 otherwise
# Print the modified DataFrame with new features
print(df_tsdm.head())
Unnamed: 0 capital_projects.soft_opening_date calendar.calendar_day_date \
0 1 6/14/2022 2022-06-17
1 2 6/14/2022 2022-06-22
2 3 6/14/2022 2022-06-23
3 4 6/14/2022 2022-06-26
4 5 6/14/2022 2022-06-27
calendar.fiscal_week_id_for_year calendar.day_of_week \
0 25 4
1 25 2
2 25 3
3 26 6
4 26 0
calendar_information.holiday calendar_information.type_of_day \
0 NONE WEEKDAY
1 NONE WEEKDAY
2 NONE WEEKDAY
3 NONE WEEKEND
4 NONE WEEKDAY
daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service \
0 2168.2920 861.6930
1 2051.5635 808.0275
2 2257.5000 966.4410
3 1520.5925 542.3250
4 1897.6930 771.4525
diesel ... Year Month Fiscal_Year Fiscal_Week year month day \
0 722.7745 ... 2022 2022-06-17 2022 24 2022 6 17
1 730.4850 ... 2022 2022-06-22 2022 25 2022 6 22
2 895.7970 ... 2022 2022-06-23 2022 25 2022 6 23
3 584.2900 ... 2022 2022-06-26 2022 26 2022 6 26
4 852.2605 ... 2022 2022-06-27 2022 26 2022 6 27
quarter day_of_week is_weekend
0 2 4 0
1 2 2 0
2 2 3 0
3 2 6 1
4 2 0 0
[5 rows x 25 columns]
#Create a stacked bar chart for monthly total inside sales and total food service sales in millions of USD
monthly_sales = df_tsdm.groupby(['year', 'month'])[['daily_yoy_ndt.total_food_service','daily_yoy_ndt.total_inside_sales']].sum() / 1_000_000 # Divide by 1 million
plt.figure(figsize=(12, 6))
# Specify custom colors for the bars
colors = ['green', 'orange']
ax = monthly_sales.plot(kind='bar', stacked=True, color=colors, figsize=(12, 6))
plt.title('Monthly Sales (Stacked)')
plt.xlabel('Month and Year')
plt.ylabel('Sales (in millions USD)')
plt.grid(axis='y')
# Annotate data values on top of each bar segment with exact coordinates
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
if height > 0.2: # Only annotate bars with significant height
ax.annotate(f'{height:.2f}M', (x + width/2, y + height + 0.02), ha='center', va='center', fontsize=10, color='black')
# Set custom x-labels using month and year values from the DataFrame
custom_labels = [f'{int(x[1])}, {int(x[0])}' for x in monthly_sales.index]
ax.set_xticks(range(len(custom_labels)))
ax.set_xticklabels(custom_labels, rotation=45, ha='right')
plt.legend(['Inside Sales', 'Food Service Sales'], loc='upper right')
plt.tight_layout()
plt.show()
<Figure size 1200x600 with 0 Axes>
May 2022 has the highest Combined Sales
Seasonal Plot: Plot seasonal patterns by aggregating data over seasons or quarters.
#Plot quarterly total inside sales
quarterly_sales = df_tsdm.groupby(['year', 'quarter'])['daily_yoy_ndt.total_inside_sales'].sum()
plt.figure(figsize=(12, 6))
quarterly_sales.plot(kind='bar', rot=0)
plt.title('Quarterly Total Inside Sales')
plt.xlabel('Quarter')
plt.ylabel('Total Inside Sales')
plt.grid(axis='y')
plt.show()
During 2nd,3rd quarters of Year 2022, sales were found to be highest
Box Plot by Day of the Week: Visualize how a feature varies by the day of the week.
#Box plot of inside sales by day of the week
plt.figure(figsize=(10, 6))
plt.boxplot([df_tsdm[df_tsdm['day_of_week'] == i]['daily_yoy_ndt.total_inside_sales'] for i in range(7)],
labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
plt.title('Inside Sales by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Inside Sales')
plt.grid(axis='y')
plt.show()
Combine Sales of Total Inside and Total Food Service
# Group the data by year and calculate the sum of inside sales and food service sales for each year
yearly_sales = df_tsdm.groupby(df_tsdm['calendar.calendar_day_date'].dt.year)[['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service']].sum()
# Create a new column for the combined sales
yearly_sales['total_combined_sales'] = yearly_sales['daily_yoy_ndt.total_inside_sales'] + yearly_sales['daily_yoy_ndt.total_food_service']
# Extract years and sales values
years = yearly_sales.index
sales_values = yearly_sales['total_combined_sales']
# Create a bar plot with different colors
colors = ['blue', 'green', 'red']
plt.figure(figsize=(12, 6))
bars = plt.bar(years, sales_values, color=colors)
# Add truncated values on top of each bar
for bar, value in zip(bars, sales_values):
truncated_value = round(value, 1) # Truncate to 1 decimal place
plt.text(bar.get_x() + bar.get_width() / 2, value, f'{truncated_value:.1f}', ha='center', va='bottom', fontsize=12)
plt.title('Combined Sales by Year')
plt.xlabel('Year')
plt.ylabel('Total Combined Sales')
plt.xticks(years)
plt.grid(axis='y')
plt.show()
# Define a color palette for day types
day_type_colors = {'WEEKDAY': 'ORANGE', 'WEEKEND': 'RED'}
# Frequency distribution of day types with different colors
day_type_counts = df_tsdm['calendar_information.type_of_day'].value_counts()
total_days = len(df_tsdm) # Total number of days
plt.figure(figsize=(8, 6))
ax = sns.barplot(x=day_type_counts.index, y=day_type_counts.values, palette=day_type_colors)
# Add count annotations above the bars
for p in ax.patches:
ax.annotate(f'{p.get_height()} ({(p.get_height() / total_days) * 100:.1f}%)',
(p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', fontsize=12, color='black', xytext=(0, 10),
textcoords='offset points')
plt.title('Frequency Distribution of Day Types')
plt.xlabel('Day Type')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()
71.6% of the days are Weekday and 28.4% of the days are Weekends
Box Plots by Day Type - Distribution of Sales
# Define a color palette for day types
day_type_colors = {'WEEKDAY': 'ORANGE', 'WEEKEND': 'RED'}
# Box plots of sales by day type with different colors
plt.figure(figsize=(10, 6))
sns.boxplot(x='calendar_information.type_of_day', y='daily_yoy_ndt.total_inside_sales', data=df_tsdm, palette=day_type_colors)
plt.title('Sales Distribution by Day Type')
plt.xlabel('Day Type')
plt.ylabel('Inside Sales')
plt.xticks(rotation=45)
plt.show()
df_qdm = pd.read_csv("qualitative_data_msba.csv")
num_columns = df_qdm.shape[1]
print("Number of columns in df_qdm:", num_columns)
Number of columns in df_qdm: 55
df_qdm.describe()
| Unnamed: 0 | open_year | square_feet | front_door_count | years_since_last_project | parking_spaces | x1_mile_pop | x1_mile_emp | x1_mile_income | x1_2_mile_pop | ... | traditional_forecourt_fueling_positions | rv_lanes_fueling_positions | hi_flow_lanes_fueling_positions | hi_flow_lanes_fueling_positions_2 | rv_lanes_fueling_positions_2 | mens_toilet_count | mens_urinal_count | womens_toilet_count | womens_sink_count | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 37.000000 | 37.000000 | 37.00000 | 37.0 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | ... | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 |
| mean | 19.000000 | 2021.324324 | 4970.27027 | 2.0 | 1.648649 | 37.405405 | 6703.567568 | 4757.648649 | 53300.378378 | 1833.108108 | ... | 14.270270 | 2.513514 | 3.324324 | 3.324324 | 2.513514 | 2.378378 | 2.351351 | 4.648649 | 1.702703 | 23040.405405 |
| std | 10.824355 | 0.474579 | 575.93121 | 0.0 | 0.483978 | 5.918237 | 5694.011350 | 4697.168291 | 24333.027254 | 1915.140476 | ... | 3.948619 | 2.049683 | 2.925501 | 2.925501 | 2.049683 | 0.923500 | 0.856875 | 1.751447 | 0.740303 | 730.069801 |
| min | 1.000000 | 2021.000000 | 2933.00000 | 2.0 | 1.000000 | 23.000000 | 0.000000 | 56.000000 | 0.000000 | 0.000000 | ... | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 21560.000000 |
| 25% | 10.000000 | 2021.000000 | 5046.00000 | 2.0 | 1.000000 | 34.000000 | 1984.000000 | 1771.000000 | 39538.000000 | 262.000000 | ... | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 4.000000 | 1.000000 | 22540.000000 |
| 50% | 19.000000 | 2021.000000 | 5046.00000 | 2.0 | 2.000000 | 38.000000 | 5574.000000 | 3895.000000 | 46356.000000 | 1003.000000 | ... | 12.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 | 2.000000 | 2.000000 | 4.000000 | 2.000000 | 22890.000000 |
| 75% | 28.000000 | 2022.000000 | 5046.00000 | 2.0 | 2.000000 | 41.000000 | 11269.000000 | 6002.000000 | 73519.000000 | 2686.000000 | ... | 16.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 | 3.000000 | 3.000000 | 6.000000 | 2.000000 | 23555.000000 |
| max | 37.000000 | 2022.000000 | 6134.00000 | 2.0 | 2.000000 | 49.000000 | 18692.000000 | 26077.000000 | 110957.000000 | 5923.000000 | ... | 24.000000 | 6.000000 | 9.000000 | 9.000000 | 6.000000 | 5.000000 | 5.000000 | 10.000000 | 4.000000 | 24535.000000 |
8 rows × 28 columns
# Check for null values in the Qualitative DataFrame
null_values_qualitative = df_qdm.isnull().sum()
# Display the columns with null values and their counts
print("Null Values in Qualitative DataFrame:")
print(null_values_qualitative[null_values_qualitative > 0])
Null Values in Qualitative DataFrame: rv_lanes_layout 14 rv_lanes_stack_type 14 hi_flow_lanes_layout 15 hi_flow_lanes_stack_type 15 hi_flow_rv_lanes_layout 14 hi_flow_rv_lanes_stack_type 14 dtype: int64
#Check for Unique values in the columns with Missing values
print('rv_lanes_layout Unique values: ', df_qdm['rv_lanes_layout'].unique())
print('rv_lanes_stack_type Unique values: ', df_qdm['rv_lanes_stack_type'].unique())
print('hi_flow_lanes_layout Unique values: ', df_qdm['hi_flow_lanes_layout'].unique())
print('hi_flow_lanes_stack_type Unique values: ', df_qdm['hi_flow_lanes_stack_type'].unique())
print('hi_flow_rv_lanes_layout Unique values: ', df_qdm['hi_flow_rv_lanes_layout'].unique())
print('hi_flow_rv_lanes_stack_type Unique values: ', df_qdm['hi_flow_rv_lanes_stack_type'].unique())
rv_lanes_layout Unique values: ['Stack' 'In-Line' nan] rv_lanes_stack_type Unique values: ['HF/RV' 'None' nan] hi_flow_lanes_layout Unique values: ['Stack' 'Combo' nan] hi_flow_lanes_stack_type Unique values: ['HF/RV' nan] hi_flow_rv_lanes_layout Unique values: ['Stack' 'Combo' 'In-Line' nan] hi_flow_rv_lanes_stack_type Unique values: ['HF/RV' 'None' nan]
Since the missing values are mostly categorical and can be interpreted as not present. We are imputing a new 'Not Present' Category in place of null values. Since the 'rv_lanes_stack_type' and 'hi_flow_rv_lanes_stack_type' already have the 'None' Category, we will impute the missing values in this columns with 'None' instead of 'Not Present'.
#Imputing Missing values with 'None' and 'Not Present' categories
df_qdm['rv_lanes_stack_type'] = df_qdm['rv_lanes_stack_type'].fillna('None')
df_qdm['hi_flow_rv_lanes_stack_type'] = df_qdm['hi_flow_rv_lanes_stack_type'].fillna('None')
df_qdm = df_qdm.fillna('Not Present')
df_qdm.isnull().sum()
Unnamed: 0 0 open_year 0 square_feet 0 front_door_count 0 years_since_last_project 0 parking_spaces 0 lottery 0 freal 0 bonfire_grill 0 pizza 0 cinnabon 0 godfather_s_pizza 0 ethanol_free 0 diesel 0 hi_flow_lanes 0 rv_lanes 0 hi_flow_rv_lanes 0 def 0 cat_scales 0 car_wash 0 ev_charging 0 rv_dumps 0 propane 0 x1_mile_pop 0 x1_mile_emp 0 x1_mile_income 0 x1_2_mile_pop 0 x1_2_mile_emp 0 x1_2_mile_income 0 x5_min_pop 0 x5_min_emp 0 x5_min_inc 0 x7_min_pop 0 x7_min_emp 0 x7_min_inc 0 traditional_forecourt_fueling_positions 0 traditional_forecourt_layout 0 traditional_forecourt_stack_type 0 rv_lanes_fueling_positions 0 rv_lanes_layout 0 rv_lanes_stack_type 0 hi_flow_lanes_fueling_positions 0 hi_flow_lanes_layout 0 hi_flow_lanes_stack_type 0 hi_flow_lanes_fueling_positions_2 0 rv_lanes_fueling_positions_2 0 hi_flow_rv_lanes_layout 0 hi_flow_rv_lanes_stack_type 0 non_24_hour 0 self_check_out 0 mens_toilet_count 0 mens_urinal_count 0 womens_toilet_count 0 womens_sink_count 0 site_id_msba 0 dtype: int64
unique_counts = df_qdm.nunique()
# Sort the unique counts in descending order
unique_counts = unique_counts.sort_values(ascending=False)
# Print the number of unique values for each column in descending order
print(unique_counts)
Unnamed: 0 37 x1_mile_pop 37 x7_min_inc 37 x7_min_emp 37 x7_min_pop 37 x5_min_emp 37 x5_min_pop 37 x1_mile_emp 37 x1_2_mile_emp 37 site_id_msba 37 x5_min_inc 36 x1_mile_income 36 x1_2_mile_pop 35 x1_2_mile_income 33 parking_spaces 21 square_feet 9 hi_flow_lanes_fueling_positions_2 6 hi_flow_lanes_fueling_positions 6 mens_toilet_count 6 womens_toilet_count 6 traditional_forecourt_fueling_positions 6 rv_lanes_fueling_positions_2 5 rv_lanes_fueling_positions 5 mens_urinal_count 5 womens_sink_count 5 hi_flow_rv_lanes_layout 4 hi_flow_lanes_layout 3 rv_lanes_layout 3 traditional_forecourt_stack_type 3 hi_flow_lanes_stack_type 2 rv_lanes_stack_type 2 hi_flow_rv_lanes_stack_type 2 traditional_forecourt_layout 2 pizza 2 cat_scales 2 cinnabon 2 ethanol_free 2 hi_flow_lanes 2 rv_lanes 2 hi_flow_rv_lanes 2 def 2 years_since_last_project 2 rv_dumps 2 propane 2 bonfire_grill 2 freal 2 open_year 2 lottery 2 ev_charging 1 car_wash 1 non_24_hour 1 self_check_out 1 diesel 1 godfather_s_pizza 1 front_door_count 1 dtype: int64
from scipy import stats
# Select the columns to check for outliers (e.g., numerical columns)
numerical_columns = df_qdm.select_dtypes(include=['int64', 'float64'])
# Define a threshold for outliers
threshold = 3
# Create a DataFrame to store the z-scores
z_scores_df = (numerical_columns - numerical_columns.mean()) / numerical_columns.std()
# Calculate the number of outliers in each column
outliers_count = (z_scores_df.abs() > threshold).sum()
# Display the number of outliers in each column
print("Number of outliers in each column:")
print(outliers_count)
Number of outliers in each column: Unnamed: 0 0 open_year 0 square_feet 2 front_door_count 0 years_since_last_project 0 parking_spaces 0 x1_mile_pop 0 x1_mile_emp 1 x1_mile_income 0 x1_2_mile_pop 0 x1_2_mile_emp 1 x1_2_mile_income 0 x5_min_pop 1 x5_min_emp 0 x5_min_inc 0 x7_min_pop 1 x7_min_emp 2 x7_min_inc 0 traditional_forecourt_fueling_positions 0 rv_lanes_fueling_positions 0 hi_flow_lanes_fueling_positions 0 hi_flow_lanes_fueling_positions_2 0 rv_lanes_fueling_positions_2 0 mens_toilet_count 0 mens_urinal_count 1 womens_toilet_count 1 womens_sink_count 1 site_id_msba 0 dtype: int64
unique_site_id_count = df_tsdm['site_id_msba'].nunique()
# Print the count of unique site IDs
print("Count of unique site IDs:", unique_site_id_count)
# Get the unique site IDs
unique_site_ids = df_tsdm['site_id_msba'].unique()
# Print the unique site IDs
print(unique_site_ids)
Count of unique site IDs: 38 [24535 24255 24220 24150 23905 23835 23765 23730 23660 23555 23485 23450 23415 23380 23345 23240 23135 23065 22925 22890 22855 22820 22785 22750 22715 22680 22645 22575 22540 22505 22400 22330 22260 22120 22085 22015 21980 21560]
df_qdm.describe()
| Unnamed: 0 | open_year | square_feet | front_door_count | years_since_last_project | parking_spaces | x1_mile_pop | x1_mile_emp | x1_mile_income | x1_2_mile_pop | ... | traditional_forecourt_fueling_positions | rv_lanes_fueling_positions | hi_flow_lanes_fueling_positions | hi_flow_lanes_fueling_positions_2 | rv_lanes_fueling_positions_2 | mens_toilet_count | mens_urinal_count | womens_toilet_count | womens_sink_count | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 37.000000 | 37.000000 | 37.00000 | 37.0 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | ... | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 |
| mean | 19.000000 | 2021.324324 | 4970.27027 | 2.0 | 1.648649 | 37.405405 | 6703.567568 | 4757.648649 | 53300.378378 | 1833.108108 | ... | 14.270270 | 2.513514 | 3.324324 | 3.324324 | 2.513514 | 2.378378 | 2.351351 | 4.648649 | 1.702703 | 23040.405405 |
| std | 10.824355 | 0.474579 | 575.93121 | 0.0 | 0.483978 | 5.918237 | 5694.011350 | 4697.168291 | 24333.027254 | 1915.140476 | ... | 3.948619 | 2.049683 | 2.925501 | 2.925501 | 2.049683 | 0.923500 | 0.856875 | 1.751447 | 0.740303 | 730.069801 |
| min | 1.000000 | 2021.000000 | 2933.00000 | 2.0 | 1.000000 | 23.000000 | 0.000000 | 56.000000 | 0.000000 | 0.000000 | ... | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 21560.000000 |
| 25% | 10.000000 | 2021.000000 | 5046.00000 | 2.0 | 1.000000 | 34.000000 | 1984.000000 | 1771.000000 | 39538.000000 | 262.000000 | ... | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 4.000000 | 1.000000 | 22540.000000 |
| 50% | 19.000000 | 2021.000000 | 5046.00000 | 2.0 | 2.000000 | 38.000000 | 5574.000000 | 3895.000000 | 46356.000000 | 1003.000000 | ... | 12.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 | 2.000000 | 2.000000 | 4.000000 | 2.000000 | 22890.000000 |
| 75% | 28.000000 | 2022.000000 | 5046.00000 | 2.0 | 2.000000 | 41.000000 | 11269.000000 | 6002.000000 | 73519.000000 | 2686.000000 | ... | 16.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 | 3.000000 | 3.000000 | 6.000000 | 2.000000 | 23555.000000 |
| max | 37.000000 | 2022.000000 | 6134.00000 | 2.0 | 2.000000 | 49.000000 | 18692.000000 | 26077.000000 | 110957.000000 | 5923.000000 | ... | 24.000000 | 6.000000 | 9.000000 | 9.000000 | 6.000000 | 5.000000 | 5.000000 | 10.000000 | 4.000000 | 24535.000000 |
8 rows × 28 columns
df_tsdm.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13908 entries, 0 to 13907 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 13908 non-null int64 1 capital_projects.soft_opening_date 13908 non-null object 2 calendar.calendar_day_date 13908 non-null datetime64[ns] 3 calendar.fiscal_week_id_for_year 13908 non-null int64 4 calendar.day_of_week 13908 non-null int64 5 calendar_information.holiday 13908 non-null object 6 calendar_information.type_of_day 13908 non-null object 7 daily_yoy_ndt.total_inside_sales 13908 non-null float64 8 daily_yoy_ndt.total_food_service 13908 non-null float64 9 diesel 13908 non-null float64 10 unleaded 13908 non-null float64 11 site_id_msba 13908 non-null int64 12 total_daily_sales 13908 non-null float64 13 total_daily_fuel_sales 13908 non-null float64 14 Date 13908 non-null datetime64[ns] 15 Year 13908 non-null int64 16 Month 13908 non-null object 17 Fiscal_Year 13908 non-null int64 18 Fiscal_Week 13908 non-null object 19 year 13908 non-null int64 20 month 13908 non-null int64 21 day 13908 non-null int64 22 quarter 13908 non-null int64 23 day_of_week 13908 non-null int64 24 is_weekend 13908 non-null int64 dtypes: datetime64[ns](2), float64(6), int64(12), object(5) memory usage: 2.7+ MB
# Separate variables into categorical and continuous
categorical_vars = ['lottery', 'freal', 'bonfire_grill', 'pizza', 'cinnabon', 'ethanol_free', 'hi_flow_lanes', 'rv_lanes', 'cat_scales', 'rv_dumps', 'propane', 'traditional_forecourt_layout', 'traditional_forecourt_stack_type', 'rv_lanes_layout', 'rv_lanes_stack_type', 'hi_flow_lanes_layout', 'hi_flow_rv_lanes_layout'] # List to store categorical variable names
continuous_vars = ['open_year', 'years_since_last_project', 'lottery','square_feet', 'parking_spaces', 'x1_mile_pop', 'x1_mile_emp', 'x1_mile_income', 'x1_2_mile_pop', 'x1_2_mile_emp', 'x1_2_mile_income', 'x5_min_pop', 'x5_min_emp', 'x5_min_inc', 'x7_min_pop', 'x7_min_emp', 'x7_min_inc', 'traditional_forecourt_fueling_positions', 'rv_lanes_fueling_positions', 'hi_flow_lanes_fueling_positions', 'mens_toilet_count', 'mens_urinal_count', 'womens_toilet_count', 'womens_sink_count', 'site_id_msba'] # List to store continuous variable names
# Print the lists of categorical and continuous variables
print("Categorical Variables:")
print(categorical_vars)
print("\nContinuous Variables:")
print(continuous_vars)
Categorical Variables: ['lottery', 'freal', 'bonfire_grill', 'pizza', 'cinnabon', 'ethanol_free', 'hi_flow_lanes', 'rv_lanes', 'cat_scales', 'rv_dumps', 'propane', 'traditional_forecourt_layout', 'traditional_forecourt_stack_type', 'rv_lanes_layout', 'rv_lanes_stack_type', 'hi_flow_lanes_layout', 'hi_flow_rv_lanes_layout'] Continuous Variables: ['open_year', 'years_since_last_project', 'lottery', 'square_feet', 'parking_spaces', 'x1_mile_pop', 'x1_mile_emp', 'x1_mile_income', 'x1_2_mile_pop', 'x1_2_mile_emp', 'x1_2_mile_income', 'x5_min_pop', 'x5_min_emp', 'x5_min_inc', 'x7_min_pop', 'x7_min_emp', 'x7_min_inc', 'traditional_forecourt_fueling_positions', 'rv_lanes_fueling_positions', 'hi_flow_lanes_fueling_positions', 'mens_toilet_count', 'mens_urinal_count', 'womens_toilet_count', 'womens_sink_count', 'site_id_msba']
df_qdm['open_year'].value_counts()
2021 25 2022 12 Name: open_year, dtype: int64
num_columns = df_qdm.shape[1]
print("Number of columns in df_qdm:", num_columns)
Number of columns in df_qdm: 55
df_qdm.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 37 entries, 0 to 36 Data columns (total 55 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 37 non-null int64 1 open_year 37 non-null int64 2 square_feet 37 non-null int64 3 front_door_count 37 non-null int64 4 years_since_last_project 37 non-null int64 5 parking_spaces 37 non-null int64 6 lottery 37 non-null object 7 freal 37 non-null object 8 bonfire_grill 37 non-null object 9 pizza 37 non-null object 10 cinnabon 37 non-null object 11 godfather_s_pizza 37 non-null object 12 ethanol_free 37 non-null object 13 diesel 37 non-null object 14 hi_flow_lanes 37 non-null object 15 rv_lanes 37 non-null object 16 hi_flow_rv_lanes 37 non-null object 17 def 37 non-null object 18 cat_scales 37 non-null object 19 car_wash 37 non-null object 20 ev_charging 37 non-null object 21 rv_dumps 37 non-null object 22 propane 37 non-null object 23 x1_mile_pop 37 non-null int64 24 x1_mile_emp 37 non-null int64 25 x1_mile_income 37 non-null int64 26 x1_2_mile_pop 37 non-null int64 27 x1_2_mile_emp 37 non-null int64 28 x1_2_mile_income 37 non-null int64 29 x5_min_pop 37 non-null int64 30 x5_min_emp 37 non-null int64 31 x5_min_inc 37 non-null int64 32 x7_min_pop 37 non-null int64 33 x7_min_emp 37 non-null int64 34 x7_min_inc 37 non-null int64 35 traditional_forecourt_fueling_positions 37 non-null int64 36 traditional_forecourt_layout 37 non-null object 37 traditional_forecourt_stack_type 37 non-null object 38 rv_lanes_fueling_positions 37 non-null int64 39 rv_lanes_layout 37 non-null object 40 rv_lanes_stack_type 37 non-null object 41 hi_flow_lanes_fueling_positions 37 non-null int64 42 hi_flow_lanes_layout 37 non-null object 43 hi_flow_lanes_stack_type 37 non-null object 44 hi_flow_lanes_fueling_positions_2 37 non-null int64 45 rv_lanes_fueling_positions_2 37 non-null int64 46 hi_flow_rv_lanes_layout 37 non-null object 47 hi_flow_rv_lanes_stack_type 37 non-null object 48 non_24_hour 37 non-null object 49 self_check_out 37 non-null object 50 mens_toilet_count 37 non-null int64 51 mens_urinal_count 37 non-null int64 52 womens_toilet_count 37 non-null int64 53 womens_sink_count 37 non-null int64 54 site_id_msba 37 non-null int64 dtypes: int64(28), object(27) memory usage: 16.0+ KB
df_qdm.describe()
| Unnamed: 0 | open_year | square_feet | front_door_count | years_since_last_project | parking_spaces | x1_mile_pop | x1_mile_emp | x1_mile_income | x1_2_mile_pop | ... | traditional_forecourt_fueling_positions | rv_lanes_fueling_positions | hi_flow_lanes_fueling_positions | hi_flow_lanes_fueling_positions_2 | rv_lanes_fueling_positions_2 | mens_toilet_count | mens_urinal_count | womens_toilet_count | womens_sink_count | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 37.000000 | 37.000000 | 37.00000 | 37.0 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | ... | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 |
| mean | 19.000000 | 2021.324324 | 4970.27027 | 2.0 | 1.648649 | 37.405405 | 6703.567568 | 4757.648649 | 53300.378378 | 1833.108108 | ... | 14.270270 | 2.513514 | 3.324324 | 3.324324 | 2.513514 | 2.378378 | 2.351351 | 4.648649 | 1.702703 | 23040.405405 |
| std | 10.824355 | 0.474579 | 575.93121 | 0.0 | 0.483978 | 5.918237 | 5694.011350 | 4697.168291 | 24333.027254 | 1915.140476 | ... | 3.948619 | 2.049683 | 2.925501 | 2.925501 | 2.049683 | 0.923500 | 0.856875 | 1.751447 | 0.740303 | 730.069801 |
| min | 1.000000 | 2021.000000 | 2933.00000 | 2.0 | 1.000000 | 23.000000 | 0.000000 | 56.000000 | 0.000000 | 0.000000 | ... | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 21560.000000 |
| 25% | 10.000000 | 2021.000000 | 5046.00000 | 2.0 | 1.000000 | 34.000000 | 1984.000000 | 1771.000000 | 39538.000000 | 262.000000 | ... | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 4.000000 | 1.000000 | 22540.000000 |
| 50% | 19.000000 | 2021.000000 | 5046.00000 | 2.0 | 2.000000 | 38.000000 | 5574.000000 | 3895.000000 | 46356.000000 | 1003.000000 | ... | 12.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 | 2.000000 | 2.000000 | 4.000000 | 2.000000 | 22890.000000 |
| 75% | 28.000000 | 2022.000000 | 5046.00000 | 2.0 | 2.000000 | 41.000000 | 11269.000000 | 6002.000000 | 73519.000000 | 2686.000000 | ... | 16.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 | 3.000000 | 3.000000 | 6.000000 | 2.000000 | 23555.000000 |
| max | 37.000000 | 2022.000000 | 6134.00000 | 2.0 | 2.000000 | 49.000000 | 18692.000000 | 26077.000000 | 110957.000000 | 5923.000000 | ... | 24.000000 | 6.000000 | 9.000000 | 9.000000 | 6.000000 | 5.000000 | 5.000000 | 10.000000 | 4.000000 | 24535.000000 |
8 rows × 28 columns
# Group the data by 'site_id' and find the maximum and minimum square footage
max_square_footage = df_qdm.groupby('site_id_msba')['square_feet'].max()
min_square_footage = df_qdm.groupby('site_id_msba')['square_feet'].min()
# Find the site_id with the highest and lowest square footage
site_id_with_max_square_footage = max_square_footage.idxmax()
site_id_with_min_square_footage = min_square_footage.idxmin()
# Get the actual values of square footage
highest_square_footage = max_square_footage.max()
lowest_square_footage = min_square_footage.min()
# Print the results
print(f"Site ID with the highest square footage: {site_id_with_max_square_footage}")
print(f"Highest square footage: {highest_square_footage} square feet")
print(f"Site ID with the lowest square footage: {site_id_with_min_square_footage}")
print(f"Lowest square footage: {lowest_square_footage} square feet")
Site ID with the highest square footage: 22890 Highest square footage: 6134 square feet Site ID with the lowest square footage: 24150 Lowest square footage: 2933 square feet
# Sort the DataFrame by 'square_feet' in ascending order
df_qdm_sorted = df_qdm.sort_values(by='square_feet')
# Create a color palette with a different color for each site_id
color_palette = plt.cm.get_cmap('tab20', len(df_qdm_sorted))
# Create a bar chart of square footage in ascending order with different colors
plt.figure(figsize=(12, 6)) # Adjust the figure size as needed
# Use 'site_id_msba' as the x-axis and 'square_feet' as the y-axis
# Specify the color for each bar based on the color_palette
df_qdm_sorted.plot(kind='bar', x='site_id_msba', y='square_feet', color=[color_palette(i) for i in range(len(df_qdm_sorted))])
plt.xlabel('Site ID')
plt.ylabel('Square Footage')
plt.title('Square Footage by Site ID (Ascending Order)')
plt.xticks(rotation=90) # Rotate x-axis labels for better visibility
plt.tight_layout()
# Show the bar chart
plt.show()
<ipython-input-56-bb2319f1d611>:5: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
<Figure size 1200x600 with 0 Axes>
# Group the data by 'site_id_msba' and calculate the total parking spaces for each site
site_parking_spaces = df_qdm.groupby('site_id_msba')['parking_spaces'].sum()
# Create a figure with a size that accommodates all site IDs
plt.figure(figsize=(16, 6)) # Adjust the figure size as needed
# Create an array of x-values for each site ID
x_values = np.arange(len(site_parking_spaces))
# Generate different colors for the bubbles using the 'tab20' colormap
colors = plt.cm.tab20(np.linspace(0, 1, len(site_parking_spaces)))
# Scatter plot with site IDs, parking spaces, and customized bubble sizes and colors
plt.scatter(x_values, # X-values for site IDs
site_parking_spaces.values, # Bubble sizes based on parking spaces
s=site_parking_spaces.values * 10, # Customize bubble sizes (multiplied by 3 for larger size)
c=colors, # Use the generated colors
alpha=0.7, # Transparency
label='Parking Spaces')
# Set x-axis labels to be the site IDs
plt.xticks(x_values, site_parking_spaces.index, rotation=90) # Rotate x-axis labels for readability
# Add labels to the bubbles (optional)
for x, spaces in zip(x_values, site_parking_spaces.values):
plt.annotate(spaces,
(x, spaces), # Coordinates of the bubble
textcoords="offset points",
xytext=(0, 15), # Adjust the distance from the bubble
ha='center')
# Customize plot labels and title
plt.xlabel('Site ID')
plt.ylabel('Parking Spaces')
plt.title('Bubble Chart: Parking Spaces vs. Site ID')
# Show legend
plt.legend()
# Display the bubble chart with distinct colors
plt.tight_layout()
plt.show()
# Define a list of qualitative features to be plotted
qualitative_features = ['square_feet', 'parking_spaces']
# Create a subplot grid of 1 row and 2 columns, with a specific size
fig, axes = plt.subplots(1, 2, figsize=(18, 6))
# Set the overall title for the figure
fig.suptitle('Distribution of Store square feet and parking spaces')
# Loop through each feature in the 'qualitative_features' list
for i, col in enumerate(qualitative_features):
# Create a histogram plot for each feature using Seaborn's 'histplot'
# It uses 20 bins and also plots a Kernel Density Estimation (kde)
sns.histplot(df_qdm[col], bins=20, kde=True, ax=axes[i])
# Set the title for each subplot
axes[i].set_title(f'Distribution of {col}')
# Set the x-axis label for each subplot
axes[i].set_xlabel(col)
# Set the y-axis label for each subplot
axes[i].set_ylabel('Frequency')
# Adjust layout to prevent overlap and set the rectangle dimensions for tight_layout
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()
Square Feet: The distribution of the store size appears to be right-skewed, indicating that most stores are smaller in size.
Parking Spaces: This variable also shows a slight right-skewed distribution, suggesting that most stores have fewer parking spaces.
# Count the number of 'Yes' and 'No' entries in the 'lottery' column
lottery_counts = df_qdm['lottery'].value_counts()
# Create a bar chart
plt.figure(figsize=(8, 8))
ax = lottery_counts.plot(kind='bar', color=['skyblue', 'lightcoral'])
# Customize plot labels and title
plt.xlabel('Lottery Offered')
plt.ylabel('Count')
plt.title('Number of Sites Offering Lottery Tickets')
# Annotate the values on top of the bars
for i, v in enumerate(lottery_counts):
ax.text(i, v + 0.2, str(v), ha='center', va='bottom', fontsize=12)
# Show the plot
plt.tight_layout()
plt.show()
print("Categorical Variables:")
print(categorical_vars)
print("\nContinuous Variables:")
print(continuous_vars)
Categorical Variables: ['lottery', 'freal', 'bonfire_grill', 'pizza', 'cinnabon', 'ethanol_free', 'hi_flow_lanes', 'rv_lanes', 'cat_scales', 'rv_dumps', 'propane', 'traditional_forecourt_layout', 'traditional_forecourt_stack_type', 'rv_lanes_layout', 'rv_lanes_stack_type', 'hi_flow_lanes_layout', 'hi_flow_rv_lanes_layout'] Continuous Variables: ['open_year', 'years_since_last_project', 'lottery', 'square_feet', 'parking_spaces', 'x1_mile_pop', 'x1_mile_emp', 'x1_mile_income', 'x1_2_mile_pop', 'x1_2_mile_emp', 'x1_2_mile_income', 'x5_min_pop', 'x5_min_emp', 'x5_min_inc', 'x7_min_pop', 'x7_min_emp', 'x7_min_inc', 'traditional_forecourt_fueling_positions', 'rv_lanes_fueling_positions', 'hi_flow_lanes_fueling_positions', 'mens_toilet_count', 'mens_urinal_count', 'womens_toilet_count', 'womens_sink_count', 'site_id_msba']
# Create pie charts for each categorical column
for column in categorical_vars:
if column in df_qdm.columns:
# Count the frequency of each category in the column
category_counts = df_qdm[column].value_counts()
# Plot a pie chart
plt.figure(figsize=(6, 6))
plt.pie(category_counts, labels=category_counts.index, autopct='%1.1f%%', startangle=140)
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title(f'Pie Chart for {column}')
# Show the pie chart
plt.show()
df_qdm.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 37 entries, 0 to 36 Data columns (total 55 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 37 non-null int64 1 open_year 37 non-null int64 2 square_feet 37 non-null int64 3 front_door_count 37 non-null int64 4 years_since_last_project 37 non-null int64 5 parking_spaces 37 non-null int64 6 lottery 37 non-null object 7 freal 37 non-null object 8 bonfire_grill 37 non-null object 9 pizza 37 non-null object 10 cinnabon 37 non-null object 11 godfather_s_pizza 37 non-null object 12 ethanol_free 37 non-null object 13 diesel 37 non-null object 14 hi_flow_lanes 37 non-null object 15 rv_lanes 37 non-null object 16 hi_flow_rv_lanes 37 non-null object 17 def 37 non-null object 18 cat_scales 37 non-null object 19 car_wash 37 non-null object 20 ev_charging 37 non-null object 21 rv_dumps 37 non-null object 22 propane 37 non-null object 23 x1_mile_pop 37 non-null int64 24 x1_mile_emp 37 non-null int64 25 x1_mile_income 37 non-null int64 26 x1_2_mile_pop 37 non-null int64 27 x1_2_mile_emp 37 non-null int64 28 x1_2_mile_income 37 non-null int64 29 x5_min_pop 37 non-null int64 30 x5_min_emp 37 non-null int64 31 x5_min_inc 37 non-null int64 32 x7_min_pop 37 non-null int64 33 x7_min_emp 37 non-null int64 34 x7_min_inc 37 non-null int64 35 traditional_forecourt_fueling_positions 37 non-null int64 36 traditional_forecourt_layout 37 non-null object 37 traditional_forecourt_stack_type 37 non-null object 38 rv_lanes_fueling_positions 37 non-null int64 39 rv_lanes_layout 37 non-null object 40 rv_lanes_stack_type 37 non-null object 41 hi_flow_lanes_fueling_positions 37 non-null int64 42 hi_flow_lanes_layout 37 non-null object 43 hi_flow_lanes_stack_type 37 non-null object 44 hi_flow_lanes_fueling_positions_2 37 non-null int64 45 rv_lanes_fueling_positions_2 37 non-null int64 46 hi_flow_rv_lanes_layout 37 non-null object 47 hi_flow_rv_lanes_stack_type 37 non-null object 48 non_24_hour 37 non-null object 49 self_check_out 37 non-null object 50 mens_toilet_count 37 non-null int64 51 mens_urinal_count 37 non-null int64 52 womens_toilet_count 37 non-null int64 53 womens_sink_count 37 non-null int64 54 site_id_msba 37 non-null int64 dtypes: int64(28), object(27) memory usage: 16.0+ KB
# List of population variables
population_variables = ['x1_mile_pop', 'x1_2_mile_pop', 'x5_min_pop', 'x7_min_pop']
# Loop through each population variable and create a bar plot
for variable in population_variables:
# Group the data by site_id and calculate the mean for the population variable
data_to_plot = df_qdm.groupby('site_id_msba')[variable].mean().reset_index()
# Sort the data in descending order by the population variable
data_to_plot = data_to_plot.sort_values(by=variable, ascending=False)
# Create a bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x='site_id_msba', y=variable, data=data_to_plot, palette='viridis', order=data_to_plot['site_id_msba'])
plt.title(f'Bar Plot: {variable} by site_id (Descending Order)')
plt.xlabel('site_id')
plt.ylabel(variable)
plt.xticks(rotation=90)
plt.grid(axis='y', linestyle='--', alpha=0.7)
# Show the plot
plt.show()
# List of employment variables
employment_variables = ['x1_mile_emp', 'x1_2_mile_emp', 'x5_min_emp', 'x7_min_emp']
# Loop through each employment variable and create a bar plot
for variable in employment_variables:
# Group the data by site_id and calculate the mean for the employment variable
data_to_plot = df_qdm.groupby('site_id_msba')[variable].mean().reset_index()
# Sort the data in descending order by the employment variable
data_to_plot = data_to_plot.sort_values(by=variable, ascending=False)
# Create a bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x='site_id_msba', y=variable, data=data_to_plot, palette='viridis', order=data_to_plot['site_id_msba'])
plt.title(f'Bar Plot: {variable} by site_id (Descending Order)')
plt.xlabel('site_id')
plt.ylabel(variable)
plt.xticks(rotation=90)
plt.grid(axis='y', linestyle='--', alpha=0.7)
# Show the plot
plt.show()
# List of median income variables
income_variables = ['x1_mile_income', 'x1_2_mile_income', 'x5_min_inc', 'x7_min_inc']
# Loop through each income variable and create a bar plot
for variable in income_variables:
# Group the data by site_id and calculate the mean for the income variable
data_to_plot = df_qdm.groupby('site_id_msba')[variable].mean().reset_index()
# Sort the data in descending order by the income variable
data_to_plot = data_to_plot.sort_values(by=variable, ascending=False)
# Create a bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x='site_id_msba', y=variable, data=data_to_plot, palette='viridis', order=data_to_plot['site_id_msba'])
plt.title(f'Bar Plot: {variable} by site_id (Descending Order)')
plt.xlabel('site_id')
plt.ylabel(variable)
plt.xticks(rotation=90)
plt.grid(axis='y', linestyle='--', alpha=0.7)
# Show the plot
plt.show()
df_qdm.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 37 entries, 0 to 36 Data columns (total 55 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 37 non-null int64 1 open_year 37 non-null int64 2 square_feet 37 non-null int64 3 front_door_count 37 non-null int64 4 years_since_last_project 37 non-null int64 5 parking_spaces 37 non-null int64 6 lottery 37 non-null object 7 freal 37 non-null object 8 bonfire_grill 37 non-null object 9 pizza 37 non-null object 10 cinnabon 37 non-null object 11 godfather_s_pizza 37 non-null object 12 ethanol_free 37 non-null object 13 diesel 37 non-null object 14 hi_flow_lanes 37 non-null object 15 rv_lanes 37 non-null object 16 hi_flow_rv_lanes 37 non-null object 17 def 37 non-null object 18 cat_scales 37 non-null object 19 car_wash 37 non-null object 20 ev_charging 37 non-null object 21 rv_dumps 37 non-null object 22 propane 37 non-null object 23 x1_mile_pop 37 non-null int64 24 x1_mile_emp 37 non-null int64 25 x1_mile_income 37 non-null int64 26 x1_2_mile_pop 37 non-null int64 27 x1_2_mile_emp 37 non-null int64 28 x1_2_mile_income 37 non-null int64 29 x5_min_pop 37 non-null int64 30 x5_min_emp 37 non-null int64 31 x5_min_inc 37 non-null int64 32 x7_min_pop 37 non-null int64 33 x7_min_emp 37 non-null int64 34 x7_min_inc 37 non-null int64 35 traditional_forecourt_fueling_positions 37 non-null int64 36 traditional_forecourt_layout 37 non-null object 37 traditional_forecourt_stack_type 37 non-null object 38 rv_lanes_fueling_positions 37 non-null int64 39 rv_lanes_layout 37 non-null object 40 rv_lanes_stack_type 37 non-null object 41 hi_flow_lanes_fueling_positions 37 non-null int64 42 hi_flow_lanes_layout 37 non-null object 43 hi_flow_lanes_stack_type 37 non-null object 44 hi_flow_lanes_fueling_positions_2 37 non-null int64 45 rv_lanes_fueling_positions_2 37 non-null int64 46 hi_flow_rv_lanes_layout 37 non-null object 47 hi_flow_rv_lanes_stack_type 37 non-null object 48 non_24_hour 37 non-null object 49 self_check_out 37 non-null object 50 mens_toilet_count 37 non-null int64 51 mens_urinal_count 37 non-null int64 52 womens_toilet_count 37 non-null int64 53 womens_sink_count 37 non-null int64 54 site_id_msba 37 non-null int64 dtypes: int64(28), object(27) memory usage: 16.0+ KB
from sklearn.preprocessing import LabelEncoder
# Use the 'categorical_vars' list to identify categorical columns
# Initialize a LabelEncoder
label_encoder = LabelEncoder()
# Encode each categorical column in 'categorical_vars'
for column in categorical_vars:
if column in df_qdm.columns:
df_qdm[column] = label_encoder.fit_transform(df_qdm[column])
# The categorical columns are now encoded with numerical values
# Display the first few rows of the DataFrame to verify
print(df_qdm.head())
Unnamed: 0 open_year square_feet front_door_count \ 0 1 2021 5046 2 1 2 2021 5046 2 2 3 2021 5046 2 3 4 2021 5046 2 4 5 2021 5046 2 years_since_last_project parking_spaces lottery freal bonfire_grill \ 0 2 38 1 1 1 1 2 39 0 1 1 2 2 35 1 1 1 3 2 36 0 1 1 4 2 25 1 1 1 pizza ... rv_lanes_fueling_positions_2 hi_flow_rv_lanes_layout \ 0 0 ... 6 3 1 1 ... 4 0 2 1 ... 5 1 3 1 ... 4 0 4 0 ... 0 2 hi_flow_rv_lanes_stack_type non_24_hour self_check_out mens_toilet_count \ 0 HF/RV No Yes 2 1 HF/RV No Yes 5 2 None No Yes 3 3 HF/RV No Yes 3 4 None No Yes 0 mens_urinal_count womens_toilet_count womens_sink_count site_id_msba 0 2 6 2 21560 1 5 10 4 21980 2 2 4 1 22015 3 3 6 2 22085 4 0 0 0 22120 [5 rows x 55 columns]
merged_df = pd.merge(df_tsdm, df_qdm, on='site_id_msba', how='inner')
# Display the merged DataFrame
print(merged_df)
Unnamed: 0_x capital_projects.soft_opening_date \
0 1 6/14/2022
1 2 6/14/2022
2 3 6/14/2022
3 4 6/14/2022
4 5 6/14/2022
... ... ...
13537 13904 1/12/2021
13538 13905 1/12/2021
13539 13906 1/12/2021
13540 13907 1/12/2021
13541 13908 1/12/2021
calendar.calendar_day_date calendar.fiscal_week_id_for_year \
0 2022-06-17 25
1 2022-06-22 25
2 2022-06-23 25
3 2022-06-26 26
4 2022-06-27 26
... ... ...
13537 2021-12-28 52
13538 2022-01-01 1
13539 2022-01-04 1
13540 2022-01-07 2
13541 2022-01-11 2
calendar.day_of_week calendar_information.holiday \
0 4 NONE
1 2 NONE
2 3 NONE
3 6 NONE
4 0 NONE
... ... ...
13537 1 NONE
13538 5 New Year's Day
13539 1 NONE
13540 4 NONE
13541 1 NONE
calendar_information.type_of_day daily_yoy_ndt.total_inside_sales \
0 WEEKDAY 2168.2920
1 WEEKDAY 2051.5635
2 WEEKDAY 2257.5000
3 WEEKEND 1520.5925
4 WEEKDAY 1897.6930
... ... ...
13537 WEEKDAY 2984.0300
13538 WEEKEND 2483.4145
13539 WEEKDAY 3169.6875
13540 WEEKDAY 3559.5805
13541 WEEKDAY 3285.5305
daily_yoy_ndt.total_food_service diesel_x ... \
0 861.6930 722.7745 ...
1 808.0275 730.4850 ...
2 966.4410 895.7970 ...
3 542.3250 584.2900 ...
4 771.4525 852.2605 ...
... ... ... ...
13537 864.2795 1609.7830 ...
13538 597.8315 731.2935 ...
13539 970.4100 1851.6470 ...
13540 1072.4175 2104.9070 ...
13541 1061.8510 2337.6045 ...
hi_flow_lanes_fueling_positions_2 rv_lanes_fueling_positions_2 \
0 5 4
1 5 4
2 5 4
3 5 4
4 5 4
... ... ...
13537 4 6
13538 4 6
13539 4 6
13540 4 6
13541 4 6
hi_flow_rv_lanes_layout hi_flow_rv_lanes_stack_type non_24_hour \
0 0 HF/RV No
1 0 HF/RV No
2 0 HF/RV No
3 0 HF/RV No
4 0 HF/RV No
... ... ... ...
13537 3 HF/RV No
13538 3 HF/RV No
13539 3 HF/RV No
13540 3 HF/RV No
13541 3 HF/RV No
self_check_out mens_toilet_count mens_urinal_count \
0 Yes 1 1
1 Yes 1 1
2 Yes 1 1
3 Yes 1 1
4 Yes 1 1
... ... ... ...
13537 Yes 2 2
13538 Yes 2 2
13539 Yes 2 2
13540 Yes 2 2
13541 Yes 2 2
womens_toilet_count womens_sink_count
0 2 2
1 2 2
2 2 2
3 2 2
4 2 2
... ... ...
13537 6 2
13538 6 2
13539 6 2
13540 6 2
13541 6 2
[13542 rows x 79 columns]
merged_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 13542 entries, 0 to 13541 Data columns (total 79 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0_x 13542 non-null int64 1 capital_projects.soft_opening_date 13542 non-null object 2 calendar.calendar_day_date 13542 non-null datetime64[ns] 3 calendar.fiscal_week_id_for_year 13542 non-null int64 4 calendar.day_of_week 13542 non-null int64 5 calendar_information.holiday 13542 non-null object 6 calendar_information.type_of_day 13542 non-null object 7 daily_yoy_ndt.total_inside_sales 13542 non-null float64 8 daily_yoy_ndt.total_food_service 13542 non-null float64 9 diesel_x 13542 non-null float64 10 unleaded 13542 non-null float64 11 site_id_msba 13542 non-null int64 12 total_daily_sales 13542 non-null float64 13 total_daily_fuel_sales 13542 non-null float64 14 Date 13542 non-null datetime64[ns] 15 Year 13542 non-null int64 16 Month 13542 non-null object 17 Fiscal_Year 13542 non-null int64 18 Fiscal_Week 13542 non-null object 19 year 13542 non-null int64 20 month 13542 non-null int64 21 day 13542 non-null int64 22 quarter 13542 non-null int64 23 day_of_week 13542 non-null int64 24 is_weekend 13542 non-null int64 25 Unnamed: 0_y 13542 non-null int64 26 open_year 13542 non-null int64 27 square_feet 13542 non-null int64 28 front_door_count 13542 non-null int64 29 years_since_last_project 13542 non-null int64 30 parking_spaces 13542 non-null int64 31 lottery 13542 non-null int64 32 freal 13542 non-null int64 33 bonfire_grill 13542 non-null int64 34 pizza 13542 non-null int64 35 cinnabon 13542 non-null int64 36 godfather_s_pizza 13542 non-null object 37 ethanol_free 13542 non-null int64 38 diesel_y 13542 non-null object 39 hi_flow_lanes 13542 non-null int64 40 rv_lanes 13542 non-null int64 41 hi_flow_rv_lanes 13542 non-null object 42 def 13542 non-null object 43 cat_scales 13542 non-null int64 44 car_wash 13542 non-null object 45 ev_charging 13542 non-null object 46 rv_dumps 13542 non-null int64 47 propane 13542 non-null int64 48 x1_mile_pop 13542 non-null int64 49 x1_mile_emp 13542 non-null int64 50 x1_mile_income 13542 non-null int64 51 x1_2_mile_pop 13542 non-null int64 52 x1_2_mile_emp 13542 non-null int64 53 x1_2_mile_income 13542 non-null int64 54 x5_min_pop 13542 non-null int64 55 x5_min_emp 13542 non-null int64 56 x5_min_inc 13542 non-null int64 57 x7_min_pop 13542 non-null int64 58 x7_min_emp 13542 non-null int64 59 x7_min_inc 13542 non-null int64 60 traditional_forecourt_fueling_positions 13542 non-null int64 61 traditional_forecourt_layout 13542 non-null int64 62 traditional_forecourt_stack_type 13542 non-null int64 63 rv_lanes_fueling_positions 13542 non-null int64 64 rv_lanes_layout 13542 non-null int64 65 rv_lanes_stack_type 13542 non-null int64 66 hi_flow_lanes_fueling_positions 13542 non-null int64 67 hi_flow_lanes_layout 13542 non-null int64 68 hi_flow_lanes_stack_type 13542 non-null object 69 hi_flow_lanes_fueling_positions_2 13542 non-null int64 70 rv_lanes_fueling_positions_2 13542 non-null int64 71 hi_flow_rv_lanes_layout 13542 non-null int64 72 hi_flow_rv_lanes_stack_type 13542 non-null object 73 non_24_hour 13542 non-null object 74 self_check_out 13542 non-null object 75 mens_toilet_count 13542 non-null int64 76 mens_urinal_count 13542 non-null int64 77 womens_toilet_count 13542 non-null int64 78 womens_sink_count 13542 non-null int64 dtypes: datetime64[ns](2), float64(6), int64(56), object(15) memory usage: 8.3+ MB
import pandas as pd
# Define the target variables
target_variables = [
'daily_yoy_ndt.total_inside_sales',
'daily_yoy_ndt.total_food_service',
'diesel_x',
'unleaded'
]
# Variables to exclude
excluded_variables = ['total_daily_sales', 'total_daily_fuel_sales','Unnamed: 0_y','Unnamed: 0_x']
# Calculate the correlation matrix for the merged DataFrame
correlation_matrix = merged_df.corr()
# Create a dictionary to store the top 10 correlated variables for each target variable
top_correlated_variables = {}
# Iterate through each target variable
for target_variable in target_variables:
# Get the absolute correlations between the target variable and all other variables
correlations = correlation_matrix[target_variable].abs()
# Exclude specified variables
correlations = correlations.drop(excluded_variables, errors='ignore')
# Sort the correlations in descending order and select the top 10 (excluding the target variable itself)
top_correlations = correlations.drop(target_variable).nlargest(10)
# Store the top correlated variables in the dictionary
top_correlated_variables[target_variable] = top_correlations.index.tolist()
# Print the top 10 correlated variables for each target variable
for target_variable, correlated_variables in top_correlated_variables.items():
print(f"Top 10 correlated variables with '{target_variable}':")
print(correlated_variables)
print()
Top 10 correlated variables with 'daily_yoy_ndt.total_inside_sales': ['daily_yoy_ndt.total_food_service', 'diesel_x', 'pizza', 'rv_lanes', 'hi_flow_lanes_fueling_positions', 'hi_flow_lanes_fueling_positions_2', 'rv_lanes_fueling_positions', 'rv_lanes_fueling_positions_2', 'hi_flow_lanes', 'rv_lanes_stack_type'] Top 10 correlated variables with 'daily_yoy_ndt.total_food_service': ['daily_yoy_ndt.total_inside_sales', 'diesel_x', 'hi_flow_lanes_fueling_positions', 'hi_flow_lanes_fueling_positions_2', 'hi_flow_lanes', 'rv_lanes_stack_type', 'rv_lanes', 'rv_lanes_fueling_positions', 'rv_lanes_fueling_positions_2', 'rv_lanes_layout'] Top 10 correlated variables with 'diesel_x': ['hi_flow_lanes_fueling_positions', 'hi_flow_lanes_fueling_positions_2', 'mens_urinal_count', 'mens_toilet_count', 'womens_sink_count', 'daily_yoy_ndt.total_food_service', 'hi_flow_lanes', 'rv_lanes_stack_type', 'cat_scales', 'rv_lanes'] Top 10 correlated variables with 'unleaded': ['lottery', 'daily_yoy_ndt.total_inside_sales', 'x7_min_emp', 'open_year', 'x5_min_emp', 'x7_min_pop', 'years_since_last_project', 'site_id_msba', 'daily_yoy_ndt.total_food_service', 'x5_min_pop']
<ipython-input-70-862f048f3726>:15: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
Top 10 correlation variables with Total inside sales
# Define the top 10 correlated variables
top_correlated_vars = [
'daily_yoy_ndt.total_food_service', 'diesel_x', 'pizza', 'rv_lanes',
'rv_lanes_stack_type', 'rv_lanes_layout', 'hi_flow_rv_lanes_layout',
'hi_flow_lanes', 'rv_lanes_fueling_positions', 'hi_flow_lanes_layout'
]
# Extract the correlation matrix for the top correlated variables
correlation_matrix = merged_df[top_correlated_vars].corr()
# Create a correlogram (correlation matrix heatmap)
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlogram of Top Correlated Variables with daily_yoy_ndt.total_inside_sales')
plt.tight_layout()
# Show the correlogram
plt.show()
# Define the top 10 correlated variables
top_correlated_vars = [
'daily_yoy_ndt.total_food_service', 'diesel_x', 'pizza', 'rv_lanes',
'rv_lanes_stack_type', 'rv_lanes_layout', 'hi_flow_rv_lanes_layout',
'hi_flow_lanes', 'rv_lanes_fueling_positions', 'hi_flow_lanes_layout'
]
# Create a DataFrame for the correlation values
correlation_df = merged_df.corr() # Replace with your actual correlation matrix
# Get the correlation values for 'daily_yoy_ndt.total_inside_sales' with the top 10 variables
correlation_values = correlation_df['daily_yoy_ndt.total_inside_sales'][top_correlated_vars]
# Create a bar chart for correlations
plt.figure(figsize=(10, 6))
bar_plot = sns.barplot(x=top_correlated_vars, y=correlation_values, palette='viridis')
plt.title('Correlation with daily_yoy_ndt.total_inside_sales')
plt.xlabel('Variables')
plt.ylabel('Correlation Coefficient')
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
# Annotate the bars with correlation values
for index, value in enumerate(correlation_values):
bar_plot.text(index, value, f'{value:.2f}', ha='center', va='bottom', fontsize=10)
# Show the bar chart with annotations
plt.tight_layout()
plt.show()
<ipython-input-401-672d2503f3eb>:9: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
As per the correlated bar plot, the total inside sales were in strong correlation with 1. Food service sales 2. Diesel gallons sold and 3. Pizza availability.
Top 10 correlation variables with Total Food service sales
# Define the top 10 correlated variables
top_correlated_vars = [
'daily_yoy_ndt.total_inside_sales', 'diesel_x', 'hi_flow_lanes_fueling_positions', 'hi_flow_lanes_fueling_positions_2',
'rv_lanes_stack_type', 'hi_flow_lanes', 'rv_lanes', 'rv_lanes_fueling_positions', 'rv_lanes_fueling_positions_2', 'rv_lanes_layout'
]
# Create a DataFrame for the correlation values
correlation_df = merged_df.corr() # Replace with your actual correlation matrix
# Get the correlation values for 'daily_yoy_ndt.total_food_service' with the top 10 variables
correlation_values = correlation_df['daily_yoy_ndt.total_food_service'][top_correlated_vars]
# Create a bar chart for correlations
plt.figure(figsize=(10, 6))
bar_plot = sns.barplot(x=top_correlated_vars, y=correlation_values, palette='viridis')
plt.title('Correlation with daily_yoy_ndt.total_food_service')
plt.xlabel('Variables')
plt.ylabel('Correlation Coefficient')
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
# Annotate the bars with correlation values
for index, value in enumerate(correlation_values):
bar_plot.text(index, value, f'{value:.2f}', ha='center', va='bottom', fontsize=10)
# Show the bar chart with annotations
plt.tight_layout()
plt.show()
<ipython-input-72-de4ea397bddc>:8: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
As per the correlated bar plot, the total food service sales were in strong correlation with 1. Total Inside sales 2. Number of Diesel gallons sold 3. Availability of High flow fuel positions.
Top 10 correlation variables with Diesel
# Define the top 10 correlated variables
top_correlated_vars = [
'hi_flow_lanes_fueling_positions', 'hi_flow_lanes_fueling_positions_2', 'mens_urinal_count', 'mens_toilet_count',
'womens_sink_count', 'daily_yoy_ndt.total_food_service', 'hi_flow_lanes', 'rv_lanes_stack_type', 'cat_scales', 'rv_lanes'
]
# Create a DataFrame for the correlation values
correlation_df = merged_df.corr() # Replace with your actual correlation matrix
# Get the correlation values for 'diesel' with the top 10 variables
correlation_values = correlation_df['diesel_x'][top_correlated_vars]
# Create a bar chart for correlations
plt.figure(figsize=(10, 6))
bar_plot = sns.barplot(x=top_correlated_vars, y=correlation_values, palette='viridis')
plt.title('Correlation with diesel')
plt.xlabel('Variables')
plt.ylabel('Correlation Coefficient')
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
# Annotate the bars with correlation values
for index, value in enumerate(correlation_values):
bar_plot.text(index, value, f'{value:.2f}', ha='center', va='bottom', fontsize=10)
# Show the bar chart with annotations
plt.tight_layout()
plt.show()
<ipython-input-73-3320275aaed6>:8: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
As per the correlated bar plot, diesel gallons sold were in strong correlation with 1.Availability of High flow fuel positions.2.mens_urnial_count 3.mens_toilet_count
Top 10 correlation variables with Unleaded
# Define the top 10 correlated variables
top_correlated_vars = [
'daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'pizza', 'rv_lanes',
'rv_lanes_stack_type', 'rv_lanes_layout', 'hi_flow_rv_lanes_layout',
'hi_flow_lanes', 'rv_lanes_fueling_positions', 'hi_flow_lanes_layout'
]
# Create a DataFrame for the correlation values
correlation_df = merged_df.corr()
# Get the correlation values for 'unleaded' with the top 10 variables
correlation_values = correlation_df['unleaded'][top_correlated_vars]
# Create a bar chart for correlations
plt.figure(figsize=(10, 6))
bar_plot = sns.barplot(x=top_correlated_vars, y=correlation_values, palette='viridis')
plt.title('Correlation with unleaded')
plt.xlabel('Variables')
plt.ylabel('Correlation Coefficient')
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
# Annotate the bars with correlation values
for index, value in enumerate(correlation_values):
bar_plot.text(index, value, f'{value:.2f}', ha='center', va='bottom', fontsize=10)
# Show the bar chart with annotations
plt.tight_layout()
plt.show()
<ipython-input-404-e99c214cc2af>:9: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
As per the correlated bar plot, all non-diesel gallons/unleaded sold were in strong correlation with 1. Lottery 2. Total Inside sales 3. years since last project.
# Group the 'merged_df' DataFrame by the 'calendar_information.holiday' column
holiday_fuel_sales = merged_df.groupby('calendar_information.holiday')['total_daily_fuel_sales'].describe()
# Reset the index of the DataFrame to convert the grouped-by values into a column
holiday_fuel_sales.reset_index(inplace=True)
holiday_fuel_sales
| calendar_information.holiday | count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | All Saint's Day | 37.0 | 4410.757608 | 2779.691996 | 1145.5640 | 2892.421000 | 4071.84400 | 4851.675500 | 17498.2010 |
| 1 | Ascension | 34.0 | 4666.302015 | 3081.437077 | 1158.0450 | 3112.347875 | 4614.75525 | 5312.503875 | 19354.5240 |
| 2 | Christmas Day | 37.0 | 1551.972176 | 693.931634 | 337.1970 | 1110.791500 | 1424.07650 | 2169.359500 | 3076.4405 |
| 3 | Christmas Eve | 37.0 | 2473.872554 | 992.136641 | 843.9025 | 1757.644000 | 2262.54700 | 3024.581000 | 4894.6345 |
| 4 | Columbus Day | 37.0 | 4558.993392 | 2557.522924 | 1199.0510 | 3220.532000 | 4363.47450 | 5272.127000 | 15846.8380 |
| 5 | Easter | 36.0 | 3161.145458 | 1967.906769 | 702.0160 | 1747.592875 | 2662.85250 | 4336.787000 | 10843.1925 |
| 6 | Father's Day | 37.0 | 3341.215784 | 2249.101734 | 750.9355 | 1546.482000 | 2932.14950 | 4541.460000 | 11605.1810 |
| 7 | Flag Day | 38.0 | 4658.869618 | 3115.888701 | 1047.7915 | 3013.122000 | 4496.65825 | 5333.341125 | 19579.6755 |
| 8 | Good Friday | 36.0 | 4805.674708 | 2607.354268 | 1143.9645 | 3446.307375 | 4461.14375 | 6193.477500 | 15595.6255 |
| 9 | Halloween Day | 37.0 | 3451.179162 | 2530.643795 | 1191.6135 | 1931.660500 | 3001.17650 | 3894.352000 | 16029.1635 |
| 10 | Independence Day | 37.0 | 2913.870419 | 1846.872003 | 804.4610 | 1289.722000 | 2612.66600 | 3848.194000 | 9952.3935 |
| 11 | Labor Day | 37.0 | 3781.042338 | 2144.169470 | 1016.5645 | 2230.504500 | 3533.07150 | 4924.244500 | 10749.2175 |
| 12 | Lincoln's Birthday | 37.0 | 3080.713622 | 1489.664455 | 787.3495 | 2201.307500 | 2747.95150 | 3903.056500 | 6134.1140 |
| 13 | Martin Luther King Day | 37.0 | 3742.821108 | 1779.183035 | 1090.2045 | 2342.119500 | 3924.26650 | 4566.061500 | 8409.3590 |
| 14 | Memorial Day | 37.0 | 3649.906230 | 2263.730822 | 826.9555 | 2032.478000 | 3435.27800 | 4378.542000 | 13119.9845 |
| 15 | Mother's Day | 36.0 | 3201.183514 | 1688.735060 | 783.8460 | 1848.823375 | 3025.90050 | 4484.713625 | 7334.7645 |
| 16 | NONE | 12624.0 | 4171.412225 | 2549.105475 | 382.3365 | 2446.410750 | 3958.17100 | 5180.982625 | 23651.7295 |
| 17 | New Year's Day | 37.0 | 2048.330351 | 899.828017 | 573.9475 | 1365.843500 | 2079.25200 | 2657.298000 | 3624.8765 |
| 18 | New Year's Eve | 37.0 | 2830.532297 | 1140.178311 | 744.5970 | 2114.514500 | 2854.14150 | 3520.979000 | 5456.5700 |
| 19 | Palm Sunday | 35.0 | 3254.194300 | 2068.372616 | 696.1605 | 1900.253250 | 2742.57900 | 4459.848750 | 11501.0560 |
| 20 | President's Day | 37.0 | 4148.618203 | 2017.706565 | 1189.3490 | 2389.138500 | 4086.27800 | 5160.638000 | 10097.8010 |
| 21 | Saint Patrick's Day | 37.0 | 4274.989392 | 2320.501370 | 1018.5385 | 2855.055000 | 4198.61050 | 5137.457500 | 14241.8220 |
| 22 | Saint Valentine's Day | 37.0 | 3885.591500 | 2029.317577 | 1218.5215 | 2202.308500 | 3655.66950 | 4742.990000 | 9565.6995 |
| 23 | Thanksgiving Day | 37.0 | 2188.708446 | 1409.208831 | 704.3400 | 1378.114500 | 1927.22950 | 2499.952000 | 8747.1370 |
| 24 | Veteran's Day | 37.0 | 4590.720703 | 2643.833191 | 1062.6035 | 2881.259500 | 4456.57450 | 5209.540000 | 15532.7130 |
| 25 | Washington's Birthday | 37.0 | 3752.085608 | 2168.592030 | 1034.4495 | 2425.017000 | 3416.99750 | 4237.411500 | 11755.5165 |
#Find the holiday with the highest average total daily fuel sales
holiday_fuel_sales[holiday_fuel_sales['mean']==holiday_fuel_sales['mean'].max()]
| calendar_information.holiday | count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|
| 8 | Good Friday | 36.0 | 4805.674708 | 2607.354268 | 1143.9645 | 3446.307375 | 4461.14375 | 6193.4775 | 15595.6255 |
#Find the holiday with the lowest average total daily fuel sales
holiday_fuel_sales[holiday_fuel_sales['mean']==holiday_fuel_sales['mean'].min()]
| calendar_information.holiday | count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|
| 2 | Christmas Day | 37.0 | 1551.972176 | 693.931634 | 337.197 | 1110.7915 | 1424.0765 | 2169.3595 | 3076.4405 |
The holiday with the highest average total daily fuel sales is 'Good Friday', with a mean sales value of approximately 4805.67.
'Christmas Day' has the lowest average total daily sales, with a mean value of around 1551.97.
# Create a new figure with specified size (width=20, height=10)
plt.figure(figsize=(20, 10))
# Create a boxplot wherein
# 'x' specifies the column for the x-axis ('calendar_information.holiday')
# 'y' specifies the column for the y-axis ('total_daily_fuel_sales')
sns.boxplot(data=merged_df, x='calendar_information.holiday', y='total_daily_fuel_sales')
# Set the title of the plot
plt.title('Distribution of Total Daily Fuel Sales by Holiday')
# Set the label for the x-axis
plt.xlabel('Holiday')
# Set the label for the y-axis
plt.ylabel('Total Fuel Sales (Gallons)')
# Rotate x-axis labels by 90 degrees for better visibility
plt.xticks(rotation=90)
# Display the plot
plt.show()
There is a similar variation in total daily gallons of fuel sold across different holidays. Holidays like 'Good Friday', 'Ascension', and 'Flag Day' tend to have higher median sales compared to others.
'Thanksgiving Day' and 'Christmas Day' appear to have lower median sales, which could be due to people spending more time at home.
df_qdm.describe()
| Unnamed: 0 | open_year | square_feet | front_door_count | years_since_last_project | parking_spaces | lottery | freal | bonfire_grill | pizza | ... | hi_flow_lanes_fueling_positions | hi_flow_lanes_layout | hi_flow_lanes_fueling_positions_2 | rv_lanes_fueling_positions_2 | hi_flow_rv_lanes_layout | mens_toilet_count | mens_urinal_count | womens_toilet_count | womens_sink_count | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 37.000000 | 37.000000 | 37.00000 | 37.0 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | ... | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 |
| mean | 19.000000 | 2021.324324 | 4970.27027 | 2.0 | 1.648649 | 37.405405 | 0.621622 | 0.972973 | 0.702703 | 0.405405 | ... | 3.324324 | 0.459459 | 3.324324 | 2.513514 | 0.864865 | 2.378378 | 2.351351 | 4.648649 | 1.702703 | 23040.405405 |
| std | 10.824355 | 0.474579 | 575.93121 | 0.0 | 0.483978 | 5.918237 | 0.491672 | 0.164399 | 0.463373 | 0.497743 | ... | 2.925501 | 0.557504 | 2.925501 | 2.049683 | 1.031777 | 0.923500 | 0.856875 | 1.751447 | 0.740303 | 730.069801 |
| min | 1.000000 | 2021.000000 | 2933.00000 | 2.0 | 1.000000 | 23.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 21560.000000 |
| 25% | 10.000000 | 2021.000000 | 5046.00000 | 2.0 | 1.000000 | 34.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 4.000000 | 1.000000 | 22540.000000 |
| 50% | 19.000000 | 2021.000000 | 5046.00000 | 2.0 | 2.000000 | 38.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | ... | 5.000000 | 0.000000 | 5.000000 | 4.000000 | 0.000000 | 2.000000 | 2.000000 | 4.000000 | 2.000000 | 22890.000000 |
| 75% | 28.000000 | 2022.000000 | 5046.00000 | 2.0 | 2.000000 | 41.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 5.000000 | 1.000000 | 5.000000 | 4.000000 | 2.000000 | 3.000000 | 3.000000 | 6.000000 | 2.000000 | 23555.000000 |
| max | 37.000000 | 2022.000000 | 6134.00000 | 2.0 | 2.000000 | 49.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 9.000000 | 2.000000 | 9.000000 | 6.000000 | 3.000000 | 5.000000 | 5.000000 | 10.000000 | 4.000000 | 24535.000000 |
8 rows × 45 columns
# Convert the 'capital_projects.soft_opening_date' column to datetime
merged_df['capital_projects.soft_opening_date'] = pd.to_datetime(merged_df['capital_projects.soft_opening_date'])
#get first year non-fuel sales ('total_daily_sales')
def get_first_year_sales(group):
opening_date = group['capital_projects.soft_opening_date'].min()
one_year_later = opening_date + pd.Timedelta(days=365)
first_year_data = group[group['calendar.calendar_day_date'] <= one_year_later]
return first_year_data['total_daily_sales'].sum()
first_year_sales = merged_df.groupby('site_id_msba').apply(get_first_year_sales)
#sort values from highest sales to lowest
first_year_sales_sorted = first_year_sales.sort_values(ascending=False)
first_year_sales_sorted.head(10)
site_id_msba 22085 2.323099e+06 22750 2.056672e+06 21980 2.000774e+06 23415 1.804503e+06 23660 1.664968e+06 23345 1.663042e+06 21560 1.569289e+06 23380 1.542020e+06 23730 1.533919e+06 24150 1.498044e+06 dtype: float64
#get sites with top 10 highest sales
top_10_stores = first_year_sales_sorted.head(10).index.tolist()
#get qualitative data for sites with top 10 highest sales
qualitative_data_top_10 = df_qdm[df_qdm['site_id_msba'].isin(top_10_stores)]
qualitative_data_top_10
| Unnamed: 0 | open_year | square_feet | front_door_count | years_since_last_project | parking_spaces | lottery | freal | bonfire_grill | pizza | ... | rv_lanes_fueling_positions_2 | hi_flow_rv_lanes_layout | hi_flow_rv_lanes_stack_type | non_24_hour | self_check_out | mens_toilet_count | mens_urinal_count | womens_toilet_count | womens_sink_count | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2021 | 5046 | 2 | 2 | 38 | 1 | 1 | 1 | 0 | ... | 6 | 3 | HF/RV | No | Yes | 2 | 2 | 6 | 2 | 21560 |
| 1 | 2 | 2021 | 5046 | 2 | 2 | 39 | 0 | 1 | 1 | 1 | ... | 4 | 0 | HF/RV | No | Yes | 5 | 5 | 10 | 4 | 21980 |
| 3 | 4 | 2021 | 5046 | 2 | 2 | 36 | 0 | 1 | 1 | 1 | ... | 4 | 0 | HF/RV | No | Yes | 3 | 3 | 6 | 2 | 22085 |
| 14 | 15 | 2021 | 5046 | 2 | 2 | 34 | 0 | 1 | 1 | 1 | ... | 4 | 0 | HF/RV | No | Yes | 2 | 2 | 4 | 1 | 22750 |
| 22 | 23 | 2021 | 5046 | 2 | 2 | 29 | 1 | 1 | 1 | 1 | ... | 2 | 0 | HF/RV | No | Yes | 2 | 2 | 4 | 1 | 23345 |
| 23 | 24 | 2021 | 5046 | 2 | 2 | 34 | 0 | 1 | 1 | 1 | ... | 4 | 0 | HF/RV | No | Yes | 3 | 3 | 6 | 2 | 23380 |
| 24 | 25 | 2022 | 2940 | 2 | 1 | 23 | 0 | 1 | 1 | 1 | ... | 4 | 0 | HF/RV | No | Yes | 2 | 2 | 6 | 3 | 23415 |
| 28 | 29 | 2022 | 5046 | 2 | 1 | 28 | 0 | 1 | 1 | 1 | ... | 4 | 0 | HF/RV | No | Yes | 3 | 3 | 6 | 2 | 23660 |
| 29 | 30 | 2021 | 5046 | 2 | 2 | 37 | 0 | 1 | 1 | 1 | ... | 4 | 0 | HF/RV | No | Yes | 3 | 3 | 6 | 2 | 23730 |
| 33 | 34 | 2022 | 2933 | 2 | 1 | 30 | 1 | 1 | 1 | 0 | ... | 0 | 2 | None | No | Yes | 2 | 2 | 2 | 2 | 24150 |
10 rows × 55 columns
#count unique values for 'open year'
qualitative_data_top_10['open_year'].value_counts()
2021 7 2022 3 Name: open_year, dtype: int64
#count unique values for 'parking_spaces'
qualitative_data_top_10['parking_spaces'].value_counts()
34 2 38 1 39 1 36 1 29 1 23 1 28 1 37 1 30 1 Name: parking_spaces, dtype: int64
#count unique values for 'square_feet'
qualitative_data_top_10['square_feet'].value_counts()
5046 8 2940 1 2933 1 Name: square_feet, dtype: int64
Top 10 stores with highest sales (USD)
Open Year: All these stores opened either in 2021 or 2022.
Square Feet: Most stores have a square footage of around 5046, except for two.
Parking Spaces: The number of parking spaces varies but generally ranges from 23 to 39.
Amenities: Most stores offer amenities like lottery, freal, and a bonfire grill.
#get first year fuel sales ('total_daily_fuel_sales')
def get_first_year_fuel_sales(group):
opening_date = group['capital_projects.soft_opening_date'].min()
one_year_later = opening_date + pd.Timedelta(days=365)
first_year_data = group[group['calendar.calendar_day_date'] <= one_year_later]
return first_year_data['total_daily_fuel_sales'].sum()
first_year_fuel_sales = merged_df.groupby('site_id_msba').apply(get_first_year_fuel_sales)
#sort values from highest fuel sales to lowest
first_year_fuel_sales_sorted = first_year_fuel_sales.sort_values(ascending=False)
first_year_fuel_sales_sorted.head(10)
site_id_msba 21980 5.018284e+06 22715 2.548204e+06 23660 2.460152e+06 22820 2.162210e+06 22260 2.053452e+06 22890 1.973883e+06 22575 1.874454e+06 23380 1.836589e+06 22785 1.768989e+06 23240 1.753365e+06 dtype: float64
#get sites with top 10 highest fuel sales
top_10_stores_fuel = first_year_fuel_sales_sorted.head(10).index.tolist()
#get qualitative data for sites with top 10 highest fuel sales
fuel_qdata_top_10 = df_qdm[df_qdm['site_id_msba'].isin(top_10_stores_fuel)]
fuel_qdata_top_10
| Unnamed: 0 | open_year | square_feet | front_door_count | years_since_last_project | parking_spaces | lottery | freal | bonfire_grill | pizza | ... | rv_lanes_fueling_positions_2 | hi_flow_rv_lanes_layout | hi_flow_rv_lanes_stack_type | non_24_hour | self_check_out | mens_toilet_count | mens_urinal_count | womens_toilet_count | womens_sink_count | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 2021 | 5046 | 2 | 2 | 39 | 0 | 1 | 1 | 1 | ... | 4 | 0 | HF/RV | No | Yes | 5 | 5 | 10 | 4 | 21980 |
| 5 | 6 | 2021 | 5046 | 2 | 2 | 38 | 1 | 1 | 0 | 0 | ... | 0 | 2 | None | No | Yes | 4 | 2 | 4 | 2 | 22260 |
| 10 | 11 | 2021 | 5046 | 2 | 2 | 44 | 1 | 1 | 0 | 1 | ... | 4 | 0 | HF/RV | No | Yes | 3 | 3 | 6 | 2 | 22575 |
| 13 | 14 | 2021 | 5046 | 2 | 2 | 46 | 1 | 1 | 1 | 1 | ... | 4 | 0 | HF/RV | No | Yes | 3 | 3 | 5 | 2 | 22715 |
| 15 | 16 | 2021 | 5046 | 2 | 2 | 41 | 1 | 1 | 1 | 0 | ... | 4 | 0 | HF/RV | No | Yes | 3 | 3 | 6 | 2 | 22785 |
| 16 | 17 | 2021 | 5046 | 2 | 2 | 36 | 1 | 0 | 1 | 1 | ... | 4 | 0 | HF/RV | No | Yes | 2 | 2 | 4 | 1 | 22820 |
| 18 | 19 | 2022 | 6134 | 2 | 1 | 48 | 1 | 1 | 1 | 0 | ... | 4 | 0 | HF/RV | No | Yes | 3 | 3 | 2 | 2 | 22890 |
| 21 | 22 | 2021 | 5046 | 2 | 2 | 41 | 1 | 1 | 1 | 1 | ... | 4 | 0 | HF/RV | No | Yes | 3 | 3 | 6 | 2 | 23240 |
| 23 | 24 | 2021 | 5046 | 2 | 2 | 34 | 0 | 1 | 1 | 1 | ... | 4 | 0 | HF/RV | No | Yes | 3 | 3 | 6 | 2 | 23380 |
| 28 | 29 | 2022 | 5046 | 2 | 1 | 28 | 0 | 1 | 1 | 1 | ... | 4 | 0 | HF/RV | No | Yes | 3 | 3 | 6 | 2 | 23660 |
10 rows × 55 columns
#count unique values for 'open_year'
fuel_qdata_top_10['open_year'].value_counts()
2021 8 2022 2 Name: open_year, dtype: int64
#count unique values for 'parking_spaces'
fuel_qdata_top_10['parking_spaces'].value_counts()
41 2 39 1 38 1 44 1 46 1 36 1 48 1 34 1 28 1 Name: parking_spaces, dtype: int64
#count unique values for 'square_feet'
fuel_qdata_top_10['square_feet'].value_counts()
5046 9 6134 1 Name: square_feet, dtype: int64
Top 10 stores with highest fuel sales (Gallons)
Open Year: All these stores opened either in 2021 or 2022.
Square Feet: Most stores have a square footage of around 5046, except for one with 6134.
Parking Spaces: The number of parking spaces varies but generally ranges from 28 to 48.
Amenities: Most stores offer amenities like lottery, freal, and a bonfire grill.
# Merge 'df_qdm' DataFrame with 'first_year_sales_sorted' DataFrame based on 'site_id_msba' using a left-join
first_year_sales_qualitative_data = pd.merge(
df_qdm,
pd.DataFrame(first_year_sales_sorted).reset_index(),
left_on='site_id_msba',
right_on='site_id_msba',
how='left'
)
# Rename the column representing first-year sales to 'First_Year_Sales'
first_year_sales_qualitative_data.rename(columns={0: 'First_Year_Sales'}, inplace=True)
# Perform a similar merge for 'first_year_fuel_sales_sorted'
first_year_fuel_sales_qualitative_data = pd.merge(
df_qdm,
pd.DataFrame(first_year_fuel_sales_sorted).reset_index(),
left_on='site_id_msba',
right_on='site_id_msba',
how='left'
)
# Rename the column representing first-year fuel sales to 'First_Year_Sales'
first_year_fuel_sales_qualitative_data.rename(columns={0: 'First_Year_Sales'}, inplace=True)
# Create a new figure for plotting with a specified size (width=12, height=6)
plt.figure(figsize=(12, 6))
# Create the first subplot in a 1-row, 2-column grid
plt.subplot(1, 2, 1)
# Create a scatter plot for 'Number of Parking Spaces vs. First-Year Sales'
sns.scatterplot(data=first_year_sales_qualitative_data, x='parking_spaces', y='First_Year_Sales', palette='viridis')
plt.title('Number of Parking Spaces vs. First-Year Sales')
plt.xlabel('Number of Parking Spaces')
plt.ylabel('First-Year Sales (USD)')
# Create the second subplot in the same 1-row, 2-column grid
plt.subplot(1, 2, 2)
# Create a scatter plot for 'Number of Parking Spaces vs. First-Year Fuel Sales'
sns.scatterplot(data=first_year_fuel_sales_qualitative_data, x='parking_spaces', y='First_Year_Sales', palette='viridis')
plt.title('Number of Parking Spaces vs. First-Year Fuel Sales')
plt.xlabel('Number of Parking Spaces')
plt.ylabel('First-Year Fuel Sales (Gallons)')
# Display the plots
plt.show()
<ipython-input-421-ad4a74534801>:32: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. <ipython-input-421-ad4a74534801>:41: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
The scatter plots shows the Number of Parking Spaces vs. First-Year Sales (Fuel and Non-Fuel sales) for the top 10 performing stores. Each point represents a store.
The number of parking spaces doesn't seem to have a strong correlation with first-year sales. The number of parking spaces varies among these stores, but their first-year sales still differ significantly.
plt.figure(figsize=(15, 10))
# Create the first subplot in a 2-row, 3-column grid
# The boxplot shows the distribution of First-Year Sales based on the presence or absence of a lottery
plt.subplot(2, 3, 1)
sns.boxplot(x='lottery', y='First_Year_Sales', data=first_year_sales_qualitative_data)
plt.title('First-Year Sales (USD) by Lottery')
# Create the second subplot in the same grid
# This boxplot focuses on the feature 'freal'
plt.subplot(2, 3, 2)
sns.boxplot(x='freal', y='First_Year_Sales', data=first_year_sales_qualitative_data)
plt.title("First-Year Sales (USD) by Freal")
# Create the third subplot in the same grid
# This boxplot focuses on the feature 'bonfire_grill'
plt.subplot(2, 3, 3)
sns.boxplot(x='bonfire_grill', y='First_Year_Sales', data=first_year_sales_qualitative_data)
plt.title('First-Year Sales (USD) by Bonfire Grill')
# Create the fourth subplot, which focuses on First-Year Fuel Sales and the feature 'lottery'
plt.subplot(2, 3, 4)
sns.boxplot(x='lottery', y='First_Year_Sales', data=first_year_fuel_sales_qualitative_data)
plt.title('First-Year Fuel Sales (Gallons) by Lottery')
# Create the fifth subplot, which focuses on First-Year Fuel Sales and the feature 'freal'
plt.subplot(2, 3, 5)
sns.boxplot(x='freal', y='First_Year_Sales', data=first_year_fuel_sales_qualitative_data)
plt.title("First-Year Fuel Sales (Gallons) by Freal")
# Create the sixth and final subplot, which focuses on First-Year Fuel Sales and the feature 'bonfire_grill'
plt.subplot(2, 3, 6)
sns.boxplot(x='bonfire_grill', y='First_Year_Sales', data=first_year_fuel_sales_qualitative_data)
plt.title('First-Year Fuel Sales (Gallons) by Bonfire Grill')
# Adjust the layout so that plots do not overlap
plt.tight_layout()
# Display the figure containing all the subplots
plt.show()
First-Year Sales by Lottery: Stores without a lottery seem to have slightly higher median sales than those with a lottery. However, the difference is not very significant, and the range of sales is wide in both categories.
First-Year Sales by Freal: Stores with Freal appear to have slightly higher median sales compared to those without. Given the box plot for 'No' we can confirm that there is only 1 store with 'No' freal as seen in the dataset.
First-Year Sales by Bonfire Grill: Stores with a Bonfire Grill feature appear to have a lower median in first-year sales compared to those without this feature.
First-Year Fuel Sales by Lottery: Stores with lottery seem to have very slight higher median fuel sales than those with a lottery. However, the difference is not very significant.
First-Year Fuel Sales by Freal: Stores with Freal appear to have slightly lower median sales compared to those without. Given the box plot for 'No' we can confirm that there is only 1 store with 'No' freal as seen in the dataset.
First-Year Fuel Sales by Bonfire Grill: Stores with a Bonfire Grill feature appear to have a slightly higher median in first-year sales compared to those without this feature.
# Calculate the correlation matrix for numerical columns in the 'first_year_sales_qualitative_data' DataFrame
correlation_matrix = first_year_sales_qualitative_data.select_dtypes(include=['number']).corr()
plt.figure(figsize=(20, 15))
# Generate a heatmap to visualize the correlation matrix with spearman coefficients
sns.heatmap(correlation_matrix[['First_Year_Sales']].sort_values(by='First_Year_Sales', ascending=False),
annot=True,
cmap='coolwarm',
vmin=-1,
vmax=1,
linewidths=.5)
# Set the title for the heatmap
plt.title('Correlation Matrix: Features vs. First-Year Sales (USD)')
# Display the heatmap
plt.show()
In this correlation heatmap we can see pizza and laning positions have a strong positive correlation and population features have a low correlation with First-Year sales (USD)
# Calculate the correlation matrix for numerical columns in the 'first_year_fuel_sales_qualitative_data' DataFrame
correlation_matrix = first_year_fuel_sales_qualitative_data.select_dtypes(include=['number']).corr()
plt.figure(figsize=(20, 15))
# Generate a heatmap to visualize the correlation matrix with spearman coefficients
sns.heatmap(correlation_matrix[['First_Year_Sales']].sort_values(by='First_Year_Sales', ascending=False),
annot=True,
cmap='coolwarm',
vmin=-1,
vmax=1,
linewidths=.5)
# Set the title for the heatmap
plt.title('Correlation Matrix: Features vs. First-Year Fuel Sales (Gallons)')
# Display the heatmap
plt.show()
In this correlation heatmap we can see toilets and laning positions have a strong positive correlation and population features and lane layouts have a low correlation with First-Year fuel sales (Gallons)
plt.figure(figsize=(14, 6))
# Create the first subplot
# Generate a histogram of 'first_year_sales_sorted' with 20 bins and a Kernel Density Estimation (kde)
plt.subplot(2, 2, 1)
sns.histplot(first_year_sales_sorted, bins=20, color='skyblue', kde=True)
plt.title('Histogram of First-Year Sales')
plt.xlabel('First-Year Sales (USD)')
plt.ylabel('Frequency')
# Create the second subplot
# Generate a boxplot of 'first_year_sales_sorted'
plt.subplot(2, 2, 2)
sns.boxplot(x=first_year_sales_sorted, color='salmon')
plt.title('Boxplot of First-Year Sales')
plt.xlabel('First-Year Sales (USD)')
# Create the third subplot
# Generate a histogram of 'first_year_fuel_sales_sorted' with 20 bins and a Kernel Density Estimation (kde)
plt.subplot(2, 2, 3)
sns.histplot(first_year_fuel_sales_sorted, bins=20, color='skyblue', kde=True)
plt.title('Histogram of First-Year Fuel Sales')
plt.xlabel('First-Year Fuel Sales (Gallons)')
plt.ylabel('Frequency')
# Create the fourth subplot
# Generate a boxplot of 'first_year_fuel_sales_sorted'
plt.subplot(2, 2, 4)
sns.boxplot(x=first_year_fuel_sales_sorted, color='salmon')
plt.title('Boxplot of First-Year Fuel Sales')
plt.xlabel('First-Year Fuel Sales (Gallons)')
plt.tight_layout()
# Display the figure containing all the subplots
plt.show()
We can see that the distribution for the First-Year Sales is right skewed meaning that the first-year sales are usually on the lower side with some outliers on the higher-end.
first_year_sales_sorted.index[0]
22085
from statsmodels.tsa.seasonal import seasonal_decompose
# Find the store ID with the highest first-year sales ('top_store_id')
top_store_id = first_year_sales_sorted.index[0]
# Filter the data for the top store from the 'df_tsdm' DataFrame
top_store_data = df_tsdm[df_tsdm['site_id_msba'] == top_store_id]
# Add a new column 'total_daily_sales' to 'top_store_data' by filtering the top store's daily sales from 'merged_df'
top_store_data['total_daily_sales'] = merged_df[merged_df['site_id_msba'] == top_store_id]['total_daily_sales'].tolist()
top_store_data = top_store_data.sort_values('calendar.calendar_day_date').reset_index(drop=True)
# Set the date column as the index and convert it to a DatetimeIndex with daily frequency
top_store_data.set_index('calendar.calendar_day_date', inplace=True)
top_store_data.index = pd.DatetimeIndex(top_store_data.index).to_period('D')
# Fill any missing dates with a frequency of 'D' (daily) and fill missing values with 0
top_store_data = top_store_data.asfreq('D').fillna(0)
top_store_data.head()
<ipython-input-427-6ad61212beb1>:10: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| Unnamed: 0 | capital_projects.soft_opening_date | calendar.fiscal_week_id_for_year | calendar.day_of_week | calendar_information.holiday | calendar_information.type_of_day | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel | unleaded | ... | Year | Month | Fiscal_Year | Fiscal_Week | year | month | day | quarter | day_of_week | is_weekend | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| calendar.calendar_day_date | |||||||||||||||||||||
| 2021-02-02 | 12616 | 2021-02-02 | 5 | 1 | NONE | WEEKDAY | 3085.4005 | 1195.7400 | 765.4710 | 1044.7430 | ... | 2021 | 2021-02-02 | 2021 | 05 | 2021 | 2 | 2 | 1 | 1 | 0 |
| 2021-02-03 | 12716 | 2021-02-02 | 5 | 2 | NONE | WEEKDAY | 3567.2945 | 1357.3525 | 1419.1380 | 1460.5675 | ... | 2021 | 2021-02-03 | 2021 | 05 | 2021 | 2 | 3 | 1 | 2 | 0 |
| 2021-02-04 | 12534 | 2021-02-02 | 5 | 3 | NONE | WEEKDAY | 4141.1230 | 1606.9690 | 1111.4215 | 1723.8865 | ... | 2021 | 2021-02-04 | 2021 | 05 | 2021 | 2 | 4 | 1 | 3 | 0 |
| 2021-02-05 | 12617 | 2021-02-02 | 6 | 4 | NONE | WEEKDAY | 3795.4245 | 1455.3980 | 1292.8895 | 1662.8885 | ... | 2021 | 2021-02-05 | 2021 | 05 | 2021 | 2 | 5 | 1 | 4 | 0 |
| 2021-02-06 | 12717 | 2021-02-02 | 6 | 5 | NONE | WEEKEND | 1883.9975 | 592.1895 | 808.9200 | 1043.1820 | ... | 2021 | 2021-02-06 | 2021 | 05 | 2021 | 2 | 6 | 1 | 5 | 1 |
5 rows × 24 columns
# Perform seasonal decomposition on 'total_daily_sales' of the top-performing store
decomposition = seasonal_decompose(top_store_data['total_daily_sales'], model='additive', period=7)
plt.figure(figsize=(14, 8))
# Plot the observed (original) values
plt.subplot(411)
plt.plot(decomposition.observed.values, label='Original')
plt.legend(loc='best')
plt.title('Time Series Decomposition of Daily Sales for Top Performing Store')
# Plot the trend component
plt.subplot(412)
plt.plot(decomposition.trend.values, label='Trend')
plt.legend(loc='best')
# Plot the seasonal component
plt.subplot(413)
plt.plot(decomposition.seasonal.values, label='Seasonal')
plt.legend(loc='best')
# Plot the residual component
plt.subplot(414)
plt.plot(decomposition.resid.values, label='Residual')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
We can see a gradual upward trend in the total-daily sales. We have a weekly cycle since we chose period=7. We can see some spikes in the residual indicating some noise present which needs to be addressed.
plt.figure(figsize=(14, 7))
# Generate an autocorrelation plot for 'total_daily_sales' of the top-performing store
pd.plotting.autocorrelation_plot(top_store_data['total_daily_sales'])
# Set the title for the autocorrelation plot
plt.title('Autocorrelation Plot for the Top Store')
# Display the autocorrelation plot
plt.show()
The autocorrelation plot shown above is for the top performing store with the highest first-year sales. The plot shows how a value is correlated with its past values.
The blue region represents the confidence interval. Correlation values outside this region are statistically significant. The plot indicates some level of positive autocorrelation for lag values between 0 and approximately 20 days. Beyond 20 days, the autocorrelation values fall within the confidence interval, suggesting that sales beyond this point are not significantly correlated with the current day's sales.
from statsmodels.tsa.stattools import adfuller
# Perform the Augmented Dickey-Fuller (ADF) test on the 'total_daily_sales' of the top-performing store
result = adfuller(top_store_data['total_daily_sales'].dropna())
# Print the statistics
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:', result[4])
ADF Statistic: -3.0898204481906437
p-value: 0.027315750275665577
Critical Values: {'1%': -3.44911857009962, '5%': -2.8698097654570507, '10%': -2.5711757061225153}
ADF Statistic: -3.0898204481906437 The value of -3.089 suggests that the series is likely stationary.
p-value: 0.027315750275665577 The p-value represents the probability that the null hypothesis is true. The p-value is 0.0273, which is less than 0.05, reinforcing the idea that the series is stationary.
Critical Values:
1%: -3.44911857009962 5%: -2.8698097654570507 10%: -2.5711757061225153 In this case, the ADF statistic of -3.506 is less than the value of -3.449 at the 1% confidence level. This implies that we can be 99% confident that the series is stationary.
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.dates as mdates
# Get the IDs of the top 5 stores based on first-year sales
top_5_stores = first_year_sales_sorted.head(5).index.tolist()
# Filter the merged_df to only include data for the top 5 stores
top_5_stores_data = merged_df[merged_df['site_id_msba'].isin(top_5_stores)]
fig, axes = plt.subplots(5, 3, figsize=(18, 15))
# Loop through each of the top 5 stores
for i, store in enumerate(top_5_stores):
# Filter data for the current store
store_data = top_5_stores_data[top_5_stores_data['site_id_msba'] == store]
# Set the date as the DataFrame index and sort by date
store_data = store_data.set_index('calendar.calendar_day_date')
store_data.sort_index(inplace=True)
# Perform seasonal decomposition with a weekly period (7 days)
decomposition = seasonal_decompose(store_data['total_daily_sales'], period=7)
# Plot and configure the Observed component for the current store
axes[i, 0].plot(decomposition.observed)
axes[i, 0].set_title(f"Store {store} - Observed")
axes[i, 0].xaxis.set_major_locator(mdates.MonthLocator())
axes[i, 0].xaxis.set_major_formatter(mdates.DateFormatter("%b '%y"))
axes[i, 0].tick_params(axis='x', rotation=45)
# Plot and configure the Trend component for the current store
axes[i, 1].plot(decomposition.trend)
axes[i, 1].set_title(f"Store {store} - Trend")
axes[i, 1].xaxis.set_major_locator(mdates.MonthLocator())
axes[i, 1].xaxis.set_major_formatter(mdates.DateFormatter("%b '%y"))
axes[i, 1].tick_params(axis='x', rotation=45)
# Plot and configure the Residual component for the current store
axes[i, 2].plot(decomposition.resid)
axes[i, 2].set_title(f"Store {store} - Residual")
axes[i, 2].xaxis.set_major_locator(mdates.MonthLocator())
axes[i, 2].xaxis.set_major_formatter(mdates.DateFormatter("%b '%y"))
axes[i, 2].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
We can compare the trend, seasonality and noise in the top 5 performing stores with non-fuel sales. We can see a gradual increase in trend in most of them, with heavy spikes in residuals ins some of the stores, usually occuring during the end of year.
# Find the store ID with the highest first-year fuel sales
top_store_id_fuel = first_year_fuel_sales_sorted.index[0]
# Filter the data for the top fuel store from the 'df_tsdm' DataFrame
top_store_fuel_data = df_tsdm[df_tsdm['site_id_msba'] == top_store_id]
# Add a new column 'total_daily_sales' to 'top_store_fuel_data'
top_store_fuel_data['total_daily_sales'] = merged_df[merged_df['site_id_msba'] == top_store_id_fuel]['total_daily_fuel_sales'].tolist()
# Sort 'top_store_fuel_data' by date and reset its index
top_store_fuel_data = top_store_fuel_data.sort_values('calendar.calendar_day_date').reset_index(drop=True)
# Set the date column as the DataFrame's index
top_store_fuel_data.set_index('calendar.calendar_day_date', inplace=True)
# Convert the index to a Pandas DatetimeIndex with daily frequency
top_store_fuel_data.index = pd.DatetimeIndex(top_store_fuel_data.index).to_period('D')
# Set the DataFrame's frequency to daily and fill any missing values with 0
top_store_fuel_data = top_store_fuel_data.asfreq('D').fillna(0)
top_store_fuel_data.head()
<ipython-input-432-0541b8d47253>:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| Unnamed: 0 | capital_projects.soft_opening_date | calendar.fiscal_week_id_for_year | calendar.day_of_week | calendar_information.holiday | calendar_information.type_of_day | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel | unleaded | ... | Year | Month | Fiscal_Year | Fiscal_Week | year | month | day | quarter | day_of_week | is_weekend | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| calendar.calendar_day_date | |||||||||||||||||||||
| 2021-02-02 | 12616 | 2021-02-02 | 5 | 1 | NONE | WEEKDAY | 3085.4005 | 1195.7400 | 765.4710 | 1044.7430 | ... | 2021 | 2021-02-02 | 2021 | 05 | 2021 | 2 | 2 | 1 | 1 | 0 |
| 2021-02-03 | 12716 | 2021-02-02 | 5 | 2 | NONE | WEEKDAY | 3567.2945 | 1357.3525 | 1419.1380 | 1460.5675 | ... | 2021 | 2021-02-03 | 2021 | 05 | 2021 | 2 | 3 | 1 | 2 | 0 |
| 2021-02-04 | 12534 | 2021-02-02 | 5 | 3 | NONE | WEEKDAY | 4141.1230 | 1606.9690 | 1111.4215 | 1723.8865 | ... | 2021 | 2021-02-04 | 2021 | 05 | 2021 | 2 | 4 | 1 | 3 | 0 |
| 2021-02-05 | 12617 | 2021-02-02 | 6 | 4 | NONE | WEEKDAY | 3795.4245 | 1455.3980 | 1292.8895 | 1662.8885 | ... | 2021 | 2021-02-05 | 2021 | 05 | 2021 | 2 | 5 | 1 | 4 | 0 |
| 2021-02-06 | 12717 | 2021-02-02 | 6 | 5 | NONE | WEEKEND | 1883.9975 | 592.1895 | 808.9200 | 1043.1820 | ... | 2021 | 2021-02-06 | 2021 | 05 | 2021 | 2 | 6 | 1 | 5 | 1 |
5 rows × 24 columns
# Perform seasonal decomposition on 'total_daily_fuel_sales' of the top-performing fuel store
decomposition = seasonal_decompose(top_store_fuel_data['total_daily_fuel_sales'], model='additive', period=7)
plt.figure(figsize=(14, 8))
# Plot the observed (original) values
plt.subplot(411)
plt.plot(decomposition.observed.values, label='Original')
plt.legend(loc='best')
plt.title('Time Series Decomposition of Daily Fuel Sales for Top Performing Store')
# Plot the trend component
plt.subplot(412)
plt.plot(decomposition.trend.values, label='Trend')
plt.legend(loc='best')
# Plot the seasonal component
plt.subplot(413)
plt.plot(decomposition.seasonal.values, label='Seasonal')
plt.legend(loc='best')
# Plot the residual component
plt.subplot(414)
plt.plot(decomposition.resid.values, label='Residual')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
The time-series decomposition for the store with highest gallons of fuel sold seems to be similar to the time-series decomposition for the store with highest sales in dollar. The trend is gradually increasing with a weekly seasonality and some spike in noise.
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.dates as mdates
# Get the IDs of the top 5 stores based on first-year fuel sales
top_5_fuel_stores = first_year_fuel_sales_sorted.head(5).index.tolist()
# Filter merged_df to only include data for the top 5 fuel-selling stores
top_5_fuel_stores_data = merged_df[merged_df['site_id_msba'].isin(top_5_fuel_stores)]
fig, axes = plt.subplots(5, 3, figsize=(18, 15))
# Loop through each of the top 5 fuel-selling stores
for i, store in enumerate(top_5_fuel_stores):
# Filter data for the current store
store_data = top_5_fuel_stores_data[top_5_fuel_stores_data['site_id_msba'] == store]
# Set the date as DataFrame index and sort by date
store_data = store_data.set_index('calendar.calendar_day_date')
store_data.sort_index(inplace=True)
# Perform seasonal decomposition with a weekly period (7 days)
decomposition = seasonal_decompose(store_data['total_daily_sales'], period=7)
# Plot and configure the Observed component for the current store
axes[i, 0].plot(decomposition.observed)
axes[i, 0].set_title(f"Store {store} - Observed")
axes[i, 0].xaxis.set_major_locator(mdates.MonthLocator())
axes[i, 0].xaxis.set_major_formatter(mdates.DateFormatter("%b '%y"))
axes[i, 0].tick_params(axis='x', rotation=45)
# Plot and configure the Trend component for the current store
axes[i, 1].plot(decomposition.trend)
axes[i, 1].set_title(f"Store {store} - Trend")
axes[i, 1].xaxis.set_major_locator(mdates.MonthLocator())
axes[i, 1].xaxis.set_major_formatter(mdates.DateFormatter("%b '%y"))
axes[i, 1].tick_params(axis='x', rotation=45)
# Plot and configure the Residual component for the current store
axes[i, 2].plot(decomposition.resid)
axes[i, 2].set_title(f"Store {store} - Residual")
axes[i, 2].xaxis.set_major_locator(mdates.MonthLocator())
axes[i, 2].xaxis.set_major_formatter(mdates.DateFormatter("%b '%y"))
axes[i, 2].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
We can compare the trend, seasonality and noise in the top 5 performing stores of fuel sales. We can see a gradual increase in trend in most of them, with heavy spikes in residuals in some of the stores, usually occuring during the end of year. The dips in trend can possibly due to the holiday season at the end-of-year when people prefer being indoors and there is a low vehicle turnout.
The qualitative dataset had some missing values in the categorical columns. We have imputed then by creating a new category 'Not present' indicating that the feature is not available at the store. Is this the best way to deal with the null values? Should we use other imputation methods like mode, median?
Most of the distributions we visualized were skewed indicating the presence of outliers. What method would be best for dealing with outliers, Winsorization or using z-score to remove outliers?
Some columns are identical in the Qualitative Dataset. What needs to be done with those columns?
Some columns are having values with no variation for all the sites. Do we need to remove those while modeling as their contribution to the analysis is insignificant ?
Qualitative data provided to 37 sites but Time series data provided to 38 sites. Will we be getting the data for 38th site for Modeling ?
Throughout our exploration, we focused on understanding how various features correlate with the sales features, to aid in forecasting first-year sales for Maverik's new stores. We began by examining the qualitative and time-series datasets separately, performing EDA that included missing value treatment, outlier detection, and basic visualizations like histograms and bar charts. Sales-related features ('daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel', and 'unleaded') generally exhibited right-skewed distributions, as did store attributes like 'square_feet' and 'parking_spaces'. This skewness suggests that most stores have lower sales and smaller sizes. When examining the impact of holidays on 'total_sales', we found significant variations. For instance, 'Good Friday' showed the highest average sales, while 'New Year's Day' had the lowest. This variation in sales across holidays suggests that special events can significantly influence customer purchasing behavior. Time-series decomposition showed discernible seasonal patterns, particularly gradual increase in sales from Monday to Friday and then a dip on the Weekends. We also found a heavy dive in sales during the end of year. During 2nd,3rd quarters of Year 2022, sales were found to be highest with June-2022 is the month with highest sales.
I created visualizations to determine the revenue generating columns on our datasets by merging them together using inner merge. I also found out that there are several columns that has zero variations which might not be as useful on creating a model because they don't have discriminative power. Removing these columns can also increase computational effeciency of our model. I added questions for our group to ponder while doing this project.
I did EDA on the qualitative and time series datasets indivually observing which sites, which days of the week, and what time of the year have the highest total sales. I also added some questions at the top that I was trying to answer while doing EDA.
I worked on visualizing and interpreting the distribution of various features. I further analysed the First-Year sales and the top performing stores. I also visualized and interpreted the patterns that emerged in the analysis of First-Year sales and sales of the top performing stores. This included feature visualizations, Correlation analysis, feature and target distribution analysis, Time Series Decomposition (trend, seasonality, residuals), Autocorrelation analysis, Stationarity test and Decomposition comparison of top 5 stores. I also analyzed and visualized the effect of holidays on sales.
I worked on the EDA of Time series dataset ,Qualitative dataset and Merged datasets. I analysed the Descriptive statistics of TimeSeries dataset which included Average Fuel & Non-Fuel sales, Correlation matrix of Target Variables with other variables & Feature engineering. Contributed to Introduction,Questions & Results area.